Fine-tuning with LoRA has few hyperparameters, but those few numbers can be the difference between a successful tune and gibberish output. The good news is that once you know what each one does, choosing them is no longer guesswork. Let’s go through them one by one.

Rank: the size of what you learn

The rank (r) sets how large the small matrices LoRA trains are — and therefore how much learning capacity you have. A larger rank gives more capacity but risks overfitting and costs more. A practical guide by task type:

  • Style and formatting: r=4 to 8
  • Instruction following: r=8 to 16
  • Domain knowledge: r=16 to 32
  • Complex tasks: r=32 to 64

The golden rule: start with r=16. If the model underfits (training loss doesn’t reach its floor), raise the rank; if it overfits (validation loss climbs), lower it.

Alpha and the alpha-over-rank scale

Alpha (lora_alpha) has no meaning on its own; what matters is the ratio α/r, because LoRA’s contribution to the output scales by exactly that ratio. The common rule is to set α = 2r (so α/r = 2). The benefit is that if you later change the rank, LoRA’s contribution stays constant. If training is unstable, α = r is more conservative; and if you set α very large, you’ll need to lower the learning rate.

Learning rate

LoRA accepts a higher learning rate than full fine-tuning, because it trains only a few small matrices. A good range is 1e-4 to 3e-4; the standard 2e-4 with a cosine schedule and about three percent warmup is a safe starting point. If the output turns to gibberish, this is the first thing to check; a learning rate that’s too high is the most common cause of broken output.

Dropout

dropout helps prevent overfitting, and the right value depends on your dataset size:

  • Fewer than a thousand samples: 0.1
  • Between one and twenty thousand samples: 0.05
  • More than twenty thousand samples: 0.0 to 0.05

The less data you have, the greater the risk of overfitting, and a higher dropout helps.

Which layers: target_modules

Another important question is which layers LoRA applies to. The common modern practice is "all-linear": apply to all linear layers. The older advice of only q_proj and v_proj has been set aside today, because targeting all layers gives better quality. The reason is that attention layers route information and MLP layers hold knowledge; for a task that needs new knowledge, both matter.

Number of epochs

Tie the number of epochs to the data size. Less data needs more epochs:

  • Fewer than a thousand samples: 3 to 5 epochs
  • One to ten thousand samples: 2 to 3 epochs
  • Ten to fifty thousand samples: 1 to 2 epochs
  • More than fifty thousand samples: 1 epoch

For large datasets, many epochs cause overfitting; one is often enough.

Putting it together: a safe starting point

If you just want to start, this combination works for most jobs: r=16, α=32, learning rate 2e-4, target_modules="all-linear", dropout matched to your data size, and one or two epochs. Then watch the training and validation loss: if both come down, you’re on the right track; if validation climbs, reduce the rank or the epochs. These few numbers decide ninety percent of the result.