Fine-tune your first model on free Colab: QLoRA in about 40 lines

Fine-tuning a model doesn't have to need an expensive cluster. With QLoRA you can tune a small model on a free GPU in just a few dozen lines of code.

Fine-tuning a language model used to be the business of expensive clusters. Today, with a method called QLoRA, you can tune a small model on the very same free GPU available in Colab. In this guide, we walk the whole path in about forty lines of code.

Why QLoRA makes it possible

Fully fine-tuning a seven-billion-parameter model needs around 140 GB of memory — a job for several GPUs side by side. LoRA brings that figure down to about 16 GB, because instead of training the whole model it trains only a few small matrices and keeps the main body frozen. QLoRA goes one step further: it compresses the main body to four-bit precision and trains those same small matrices on top of it. For that same seven-billion model, the result is about 6 GB — a size that fits on an ordinary free GPU. The price of this saving is small: training is a little slower and quality a little lower, which is acceptable for most jobs.

The path, step by step

The whole job comes down to a few parts: load the base model in four-bit, attach the LoRA adapters, prepare a small set of data, set a few key parameters, and train.

from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig, TrainingArguments)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset
import torch

model_name = "your-base-model"

# 1. Load the base model in four-bit (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Attach the LoRA adapters
lora_config = LoraConfig(
    r=16, lora_alpha=32, target_modules="all-linear",
    lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 3. Load the data (point this at your own dataset)
dataset = load_dataset("your_dataset", split="train")

# 4. Training settings
args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    fp16=True,
    gradient_checkpointing=True,
    logging_steps=10,
    output_dir="./out")

# 5. Train and save
trainer = SFTTrainer(model=model, args=args,
                     train_dataset=dataset, max_seq_length=2048)
trainer.train()
model.save_pretrained("./my-lora-adapter")

The key settings to know

A few numbers in this code matter more than the rest. The rank (r) sets the size of the trained matrices; a good starting point is r=16. Alpha (lora_alpha) is usually set to twice the rank, i.e. 32. The option target_modules="all-linear" means the adapters apply to all the linear layers, which is today’s common practice. A learning rate of 2e-4 suits QLoRA. And be careful about the number of epochs: one is often enough, and more epochs raise the risk of overfitting.

Merging and inference

After training, you have two options. You can keep the small adapter separate — only a few dozen megabytes — and load it alongside the base model; this is good when you have several different jobs. Or you can merge the adapter into the base model to get a single, unified model that runs at full speed; this is better suited to final deployment. To test the result, just put the model in evaluation mode, generate from a sample prompt, and compare it with the base model.

When you hit a problem

Three common problems and their fixes: if you run out of memory, try these in order — enable gradient checkpointing, drop the batch size to one and make up for it with gradient accumulation, use the four-bit mode, and finally shorten the sequence length. If the loss isn’t coming down, raise the learning rate a little and make sure the adapters are applied correctly. And if the output is gibberish, the learning rate is usually too high or there are too many epochs. With just these few settings, almost all of the early problems are resolved.