TechLead
Lesson 22 of 24
5 min read
AI Agents & RAG

Fine-Tuning Basics

Learn when and how to fine-tune LLMs, including LoRA, QLoRA, and best practices for custom models

When to Fine-Tune

Fine-tuning adapts a pre-trained LLM to perform better on specific tasks or domains by training it on your own data. However, fine-tuning is expensive, time-consuming, and often unnecessary. Before fine-tuning, exhaust these alternatives: better prompting, few-shot examples, RAG, and prompt chaining.

Should You Fine-Tune?

Situation Recommendation Why
Need domain knowledgeUse RAGCheaper, updatable, no training needed
Want specific output formatUse promptingFew-shot examples in prompt usually suffice
Need consistent style/toneConsider fine-tuningStyle is hard to capture in prompts alone
Specific classification taskFine-tuneFine-tuned models are faster and cheaper per call
Reduce latency at scaleFine-tune smaller modelFine-tuned small model can match larger model quality

LoRA: Efficient Fine-Tuning

LoRA (Low-Rank Adaptation) is the most popular fine-tuning technique. Instead of updating all model parameters (billions), LoRA adds small trainable matrices to each layer. This reduces GPU memory requirements by 10-100x while achieving quality close to full fine-tuning.

# Fine-tuning with LoRA using Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # Rank of the update matrices (8-64 typical)
    lora_alpha=32,       # Scaling factor (usually 2x rank)
    lora_dropout=0.05,   # Dropout for regularization
    target_modules=[     # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 13,631,488 || all params: 8,043,163,648 || trainable%: 0.17"

# Prepare dataset
dataset = load_dataset("json", data_files="training_data.jsonl")

# Format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# Training
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=TrainingArguments(
        output_dir="./output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_steps=100,
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,
    ),
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()
trainer.save_model("./my-fine-tuned-model")

QLoRA: Fine-Tuning on Consumer GPUs

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer GPUs (e.g., a single 24GB GPU can fine-tune a 70B parameter model).

# QLoRA - fine-tuning with 4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",         # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,     # Double quantization for extra savings
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Apply LoRA on top of quantized model
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
))

# Train as normal - only LoRA weights are updated
# The 4-bit base model stays frozen

Preparing Training Data

// Preparing fine-tuning data for OpenAI format
interface TrainingExample {
  messages: { role: "system" | "user" | "assistant"; content: string }[];
}

function prepareTrainingData(
  examples: { input: string; output: string; systemPrompt?: string }[]
): TrainingExample[] {
  return examples.map(ex => ({
    messages: [
      ...(ex.systemPrompt
        ? [{ role: "system" as const, content: ex.systemPrompt }]
        : []),
      { role: "user" as const, content: ex.input },
      { role: "assistant" as const, content: ex.output },
    ],
  }));
}

// OpenAI fine-tuning (managed service)
import OpenAI from "openai";
import * as fs from "fs";

const openai = new OpenAI();

// Upload training file
const file = await openai.files.create({
  file: fs.createReadStream("training_data.jsonl"),
  purpose: "fine-tune",
});

// Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
  training_file: file.id,
  model: "gpt-4o-mini-2024-07-18",
  hyperparameters: {
    n_epochs: 3,
    batch_size: "auto",
    learning_rate_multiplier: "auto",
  },
});

// Monitor progress
const status = await openai.fineTuning.jobs.retrieve(job.id);
console.log(status.status); // "running", "succeeded", "failed"

Training Data Best Practices

  • Quality over quantity: 100 high-quality examples often beat 10,000 noisy ones
  • Diverse examples: Cover edge cases and variations, not just the happy path
  • Consistent format: Use the same instruction format across all examples
  • Include negatives: Show examples of what NOT to do or how to refuse inappropriate requests
  • Validate with humans: Have domain experts review training data for accuracy

Fine-Tuning Pitfalls

  • Catastrophic forgetting: The model loses general capabilities — use low learning rates and few epochs
  • Overfitting: With too little data, the model memorizes instead of generalizing — use validation sets
  • Data contamination: Don't include eval data in training — it inflates metrics
  • Stale models: Fine-tuned models don't get base model updates — you'll need to re-fine-tune periodically
  • Hidden costs: Training compute, data preparation time, and ongoing maintenance add up fast

Summary

Fine-tuning is a powerful tool but should be a last resort, not a first step. Try prompting, RAG, and few-shot examples first. When fine-tuning is genuinely needed, LoRA and QLoRA make it accessible on reasonable hardware. Focus on high-quality training data, start with a small number of examples, evaluate rigorously, and be prepared for ongoing maintenance as base models evolve.

Continue Learning