How to Fine-Tune LLMs with LoRA and QLoRA: A Production Python Guide

Learn to fine-tune LLMs with LoRA and QLoRA on a single GPU using Python and Hugging Face. Covers dataset prep, training with TRL and PEFT, evaluation, GGUF export, and Ollama deployment.

Fine-tuning large language models used to mean booking multi-GPU clusters and watching five-figure cloud bills pile up. Not anymore. LoRA and QLoRA changed the game entirely. These parameter-efficient fine-tuning (PEFT) techniques let you adapt billion-parameter models on a single consumer GPU — training just a small fraction of the weights while keeping most of the base model's knowledge intact.

This guide walks you through the whole production pipeline: how LoRA and QLoRA actually work under the hood, preparing datasets, training with Hugging Face's TRL and PEFT libraries, evaluating your results, and deploying your fine-tuned model with Ollama. Every code example here is tested against the current 2026 stack — Python 3.11+, PyTorch 2.5+, TRL 0.29, and PEFT 0.14+.

When Fine-Tuning Beats Prompting

Before you invest time in fine-tuning, it's worth asking: do you actually need it?

Honestly, prompt engineering with a few well-chosen examples can get you 70–80% of fine-tuning performance for simpler tasks. I've seen teams spend weeks fine-tuning a model when a good system prompt would've done the trick. But fine-tuning becomes genuinely essential when you need:

  • Consistent output formatting — structured JSON, specific schemas, or domain-specific syntax that the base model keeps getting wrong no matter how you prompt it
  • Domain-specific behavior — legal reasoning, medical terminology, or internal company jargon the base model was never trained on
  • Reduced latency and cost — a fine-tuned 8B model can outperform a 70B model with complex prompts, saving tokens and inference time
  • Behavioral consistency — reliable tone, style, and response patterns across thousands of interactions
  • Privacy requirements — running a fine-tuned model locally means you don't have to send sensitive data to third-party APIs

How LoRA Works

Low-Rank Adaptation (LoRA) is built on a clever insight: the weight updates during fine-tuning have a much lower intrinsic rank than the full weight matrices. So instead of updating all parameters, LoRA freezes the pretrained weights and injects small trainable matrices into specific transformer layers.

For a weight matrix W of dimensions d × k, LoRA decomposes the update like this:

W_new = W + B × A

Where:
  W    = frozen pretrained weights (d × k)
  B    = trainable matrix (d × r)
  A    = trainable matrix (r × k)
  r    = rank (typically 8–64, much smaller than d or k)

The math here is actually pretty elegant. Instead of training millions or billions of parameters, you're only training the product B × A, which contains r × (d + k) parameters. For a typical transformer layer with d = k = 4096 and r = 16, that's 131,072 trainable parameters versus 16,777,216 for the full matrix. A 128x reduction.

LoRA is typically applied to the attention projection matrices (q_proj, k_proj, v_proj, o_proj) and sometimes the feed-forward layers (gate_proj, up_proj, down_proj). The scaling factor lora_alpha / r controls how much the adapter influences the output.

How QLoRA Extends LoRA

QLoRA (Quantized Low-Rank Adaptation) takes things a step further by quantizing the base model weights to 4-bit precision. It introduces three key innovations:

  • 4-bit NormalFloat (NF4) — an information-theoretically optimal data type for normally distributed weights, cutting memory by 75% compared to 16-bit storage
  • Double quantization — quantizes the quantization constants themselves, saving an additional ~0.4 bits per parameter at no performance cost (yes, really — free savings)
  • Paged optimizers — pages optimizer states to CPU memory during VRAM spikes, preventing those dreaded out-of-memory crashes mid-training

The base model loads in 4-bit precision, while the LoRA adapters and all computations stay in bfloat16 or float16. Training still happens in higher precision — only storage is compressed. The practical upshot? You can fine-tune an 8B-parameter model in under 10 GB of VRAM with sequence length ≤512, batch size 1, and gradient checkpointing enabled.

Setting Up Your Environment

The 2026 fine-tuning stack needs Python 3.11+, PyTorch 2.5+, CUDA 12.x, and the Hugging Face ecosystem. Here's the complete environment setup:

# Create and activate a virtual environment
python -m venv llm-finetune
source llm-finetune/bin/activate

# Install core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install Hugging Face stack
pip install transformers datasets accelerate peft trl

# Install bitsandbytes for QLoRA quantization
pip install bitsandbytes

# Optional: Unsloth for 2x faster training and 60% less memory
pip install unsloth

Verify your GPU is detected:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Hardware Requirements

Here's what you'll need for different model sizes:

  • 7B–8B models with QLoRA — 12 GB VRAM minimum (RTX 4070 Ti, RTX 4080, RTX 4090)
  • 7B–8B models with LoRA — 24 GB VRAM (RTX 4090, A5000, L4)
  • 13B models with QLoRA — 24 GB VRAM (RTX 4090, A5000)
  • 70B models with QLoRA — 48–80 GB VRAM (A100, H100, or multi-GPU with FSDP)

Preparing Your Training Dataset

This is where most fine-tuning projects succeed or fail.

Dataset quality matters far more than quantity. Research consistently shows that 200 expert-validated examples outperform 2,000 hastily collected ones. For most tasks, 500–1,000 high-quality examples produce strong results with LoRA. I can't stress this enough — spend time curating your data rather than just throwing more of it at the model.

The standard format for instruction fine-tuning is the chat template format. Most modern models expect conversations structured as a list of messages:

from datasets import Dataset

data = [
    {
        "messages": [
            {"role": "system", "content": "You are a legal document analyzer."},
            {"role": "user", "content": "Extract the key terms from this contract clause: ..."},
            {"role": "assistant", "content": "Key terms:\n1. Indemnification: ..."}
        ]
    },
    # ... more examples
]

dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1, seed=42)

If you're working with an existing dataset from Hugging Face Hub:

from datasets import load_dataset

# Load a public instruction dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
dataset = dataset.train_test_split(test_size=0.05, seed=42)

Data Quality Checklist

  • Remove duplicates and near-duplicates
  • Ensure input-output pairs are consistent — no contradictory examples
  • Balance your dataset across categories if doing classification
  • Validate that every example matches the exact output format you expect
  • Include edge cases and negative examples (what the model should refuse or flag)

LoRA Fine-Tuning with Hugging Face PEFT and TRL

Alright, let's get to the actual training code. This section covers standard LoRA fine-tuning without quantization. Use this when you've got 24+ GB of VRAM and want the best possible adapter quality.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

# 1. Load model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",  # Requires flash-attn package
)

# 2. Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                            # Rank — 16 is a strong default
    lora_alpha=32,                   # Scaling factor (alpha/r = effective LR scalar)
    lora_dropout=0.05,               # Regularization
    bias="none",                     # Don't train bias parameters
    target_modules=[                 # Apply to attention + MLP layers
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_rslora=True,                 # Rank-Stabilized LoRA for better scaling
)

# 3. Load and prepare dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")
eval_dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft[:500]")

# 4. Configure training
training_args = SFTConfig(
    output_dir="./llama3-lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size = 16
    learning_rate=2e-4,               # Higher LR for LoRA adapters
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    bf16=True,
    gradient_checkpointing=True,      # Trade compute for memory
    max_seq_length=2048,
    report_to="none",                 # Set to "wandb" for W&B logging
)

# 5. Create trainer and start training
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
)

trainer.train()

# 6. Save the adapter
trainer.save_model("./llama3-lora-adapter")

The adapter directory will be surprisingly small — typically 10–100 MB depending on rank and target modules, versus 15+ GB for the full model. That's the beauty of LoRA.

QLoRA Fine-Tuning: Maximum Memory Efficiency

QLoRA adds 4-bit quantization to the base model, which dramatically cuts VRAM requirements. This is the approach you'll want when working with consumer GPUs (12–24 GB VRAM) or when fine-tuning larger models than your hardware would normally support.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, TaskType, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

# 1. Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NF4 is optimal for normal-distributed weights
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16 for stability
    bnb_4bit_use_double_quant=True,        # Saves ~0.4 bits/param for free
)

# 2. Load quantized model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# 3. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# 4. Configure LoRA (applied on top of quantized base)
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",  # Apply LoRA to all linear layers
    use_rslora=True,
)

# 5. Load dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")
eval_dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft[:500]")

# 6. Training configuration (adjusted for QLoRA)
training_args = SFTConfig(
    output_dir="./llama3-qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,         # Lower batch size for less VRAM
    gradient_accumulation_steps=8,          # Effective batch size = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    bf16=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    max_seq_length=1024,                    # Shorter sequences save memory
    report_to="none",
)

# 7. Train
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
)

trainer.train()
trainer.save_model("./llama3-qlora-adapter")

Evaluating Your Fine-Tuned Model

Don't skip this step. Seriously.

Evaluation should combine automated metrics with qualitative analysis. Relying on training loss alone isn't enough — a model can have beautifully low loss while still producing garbage outputs. I've been burned by this more than once.

Quantitative Evaluation

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

# Load the fine-tuned adapter
model = AutoPeftModelForCausalLM.from_pretrained(
    "./llama3-qlora-adapter",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def evaluate_task(model, tokenizer, test_cases):
    """Run test cases and calculate pass rate."""
    results = []
    for test in test_cases:
        messages = [
            {"role": "system", "content": test["system"]},
            {"role": "user", "content": test["input"]},
        ]
        input_ids = tokenizer.apply_chat_template(
            messages, return_tensors="pt"
        ).to(model.device)

        with torch.no_grad():
            output = model.generate(
                input_ids,
                max_new_tokens=512,
                temperature=0.1,
                do_sample=True,
            )
        response = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)

        # Check against expected output criteria
        passed = test["validator"](response)
        results.append({"input": test["input"], "output": response, "passed": passed})

    pass_rate = sum(r["passed"] for r in results) / len(results)
    return pass_rate, results

# Example test cases
test_cases = [
    {
        "system": "You are a legal document analyzer.",
        "input": "Extract the parties from: Agreement between Acme Corp and Beta LLC...",
        "validator": lambda r: "Acme Corp" in r and "Beta LLC" in r,
    },
    # Add 50-100 test cases for reliable metrics
]

pass_rate, results = evaluate_task(model, tokenizer, test_cases)
print(f"Task pass rate: {pass_rate:.1%}")

Side-by-Side Comparison

Always compare your fine-tuned model against the base model on the same test set. A fine-tuned model should show clear improvement on your target task without significant regression on general capabilities. If pass rates drop below your baseline, it's time to tweak hyperparameters — try a lower learning rate, fewer epochs, or more training data.

Merging Adapters and Exporting to GGUF

For deployment, you've got two paths: serve the adapter separately (handy for multi-tenant setups where one base model hosts multiple adapters) or merge the adapter into the base model for a self-contained deployment.

Merging the Adapter

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Load adapter and merge into base model
model = AutoPeftModelForCausalLM.from_pretrained(
    "./llama3-qlora-adapter",
    torch_dtype="auto",
    device_map="cpu",  # Merge on CPU to avoid VRAM issues
)
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("./llama3-merged")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.save_pretrained("./llama3-merged")

Converting to GGUF for Local Inference

GGUF is the standard format for local inference engines like llama.cpp and Ollama. You'll need the llama.cpp conversion tools for this:

# Clone llama.cpp and install dependencies
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements/requirements-convert_hf_to_gguf.txt

# Convert the merged model to GGUF (F16 precision)
python convert_hf_to_gguf.py ../llama3-merged --outfile ../llama3-merged.gguf --outtype f16

# Quantize to Q4_K_M for smaller size and faster inference
make -j quantize
./quantize ../llama3-merged.gguf ../llama3-Q4_K_M.gguf Q4_K_M

Common quantization levels and their trade-offs:

  • Q8_0 — highest quality, ~8 GB for a 7B model
  • Q5_K_M — good balance of quality and size, ~5.5 GB
  • Q4_K_M — the recommended default, ~4.5 GB with minimal quality loss
  • Q3_K_M — smaller but you'll notice the quality drop

Deploying with Ollama

Ollama gives you the simplest path from GGUF to a running local API. Create a Modelfile that points to your quantized model:

# Modelfile
FROM ./llama3-Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

SYSTEM """You are a legal document analyzer. Extract structured information from legal documents accurately and concisely."""

Then register and run it:

# Create the model in Ollama
ollama create legal-analyzer -f Modelfile

# Test it
ollama run legal-analyzer "Extract the key terms from this NDA: ..."

# Serve as an API (OpenAI-compatible)
# Ollama automatically exposes the API at http://localhost:11434
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "legal-analyzer", "messages": [{"role": "user", "content": "Extract parties from this contract..."}]}'

One thing worth mentioning: Ollama's API is compatible with the OpenAI client library, so it works as a drop-in replacement in apps that already use the OpenAI API format. That makes migration pretty painless.

Production Best Practices

Hyperparameter Selection Guide

  • Rank (r) — start with 16. Bump it to 32 or 64 if the model underfits; drop to 8 if you're seeing overfitting
  • Alpha (lora_alpha) — set to 2x the rank as a starting point (e.g., r=16, alpha=32)
  • Learning rate — use 1e-4 to 2e-4 for LoRA (roughly 10x higher than full fine-tuning)
  • Epochs — 1–3 for most tasks. Watch validation loss and stop if it starts climbing
  • Target modules — start with attention layers only (q_proj, v_proj), then expand to all linear layers if performance isn't where you need it

Avoiding Common Pitfalls

  • Catastrophic forgetting — keep training epochs low (1–3) and learning rate conservative. Always evaluate general capabilities alongside task-specific metrics
  • Overfitting on small datasets — increase dropout (lora_dropout=0.1), reduce rank, or use data augmentation
  • Inconsistent chat templates — always use the tokenizer's apply_chat_template() method during both training and inference. This one bites people more often than you'd think
  • Mixed precision issues — use bfloat16 over float16 when your hardware supports it to avoid loss scaling headaches

Multi-Adapter Serving

For multi-tenant applications, you can keep one base model in memory and load task-specific adapters dynamically. Since each adapter is only 10–100 MB, you can serve dozens of specialized models from a single GPU. Libraries like LoRAX and vLLM support efficient multi-adapter serving with minimal latency overhead.

LoRA vs. QLoRA vs. Full Fine-Tuning: Decision Guide

Choosing the right approach comes down to your hardware, budget, and accuracy requirements:

  • Full fine-tuning — maximum accuracy, but requires multi-GPU clusters (100+ GB VRAM for 7B models). Go this route when you have the budget and need every last percentage point on complex domains like code generation or mathematical reasoning
  • LoRA — gets you 90–95% of full fine-tuning quality at roughly 80% lower cost. Best when you have 24+ GB VRAM and want the highest adapter quality. Ideal for production deployments with multi-adapter serving
  • QLoRA — delivers 85–93% of full fine-tuning quality at the lowest cost. Your go-to when working with consumer GPUs (12–24 GB), prototyping rapidly, or fine-tuning models larger than your VRAM would normally handle

Here's a workflow that works well in practice: start with QLoRA for rapid experimentation, nail down the best configuration, then train the final production adapter with standard LoRA for maximum quality. You get fast iteration early on without sacrificing quality in your deployed model.

Frequently Asked Questions

How much data do I need to fine-tune an LLM with LoRA?

For straightforward tasks like classification or data extraction, 100–500 high-quality examples are typically enough. Complex domain-specific applications may need 1,000–5,000 examples. But here's the thing — quality consistently beats quantity. Two hundred expert-validated examples will outperform 2,000 hastily collected ones almost every time. Start small, evaluate, and scale up from there.

How long does LoRA fine-tuning take?

On a single GPU like an RTX 4090 or A100, LoRA fine-tuning of a 7B–8B model on 1,000–5,000 examples typically takes 1–4 hours. QLoRA is slightly slower due to the quantization/dequantization overhead but finishes in similar timeframes. Larger models (70B with QLoRA) can take 8–24 hours depending on dataset size and sequence length.

Can I fine-tune a model and still use it for general tasks?

Yes, but with some caveats. LoRA's big advantage is that the base model weights stay frozen, which preserves most general capabilities. However, aggressive fine-tuning — too many epochs, high learning rate, very narrow dataset — can cause catastrophic forgetting. Keep it to 1–3 epochs, use a moderate learning rate (1e-4 to 2e-4), and always evaluate general capabilities alongside your task-specific metrics.

What is the difference between LoRA rank and alpha?

The rank (r) controls the dimensionality of the adapter matrices — higher rank captures more complex adaptations but increases trainable parameters and memory usage. Alpha (lora_alpha) is a scaling factor applied to the adapter output. The effective scaling works out to alpha / r, so an r=16, alpha=32 config applies 2x scaling to the adapter contributions. With Rank-Stabilized LoRA (use_rslora=True), the scaling becomes alpha / sqrt(r), which helps stability when experimenting with different rank values.

Should I use LoRA or QLoRA for production?

For the final production adapter, standard LoRA generally produces slightly higher quality results because the base model runs at full precision during training. That said, QLoRA is excellent for the experimentation phase — rapidly testing different datasets, hyperparameters, and target modules on affordable hardware. A lot of teams use QLoRA for prototyping and then switch to LoRA for their final production training run. It's a solid workflow.

About the Author Editorial Team

Our team of expert writers and editors.