QLoRA
A memory-efficient fine-tuning method combining quantisation with LoRA adapters.
Full Definition
QLoRA (Quantised LoRA), introduced by Tim Dettmers et al. in 2023, fine-tunes large models by quantising the frozen base model to 4-bit precision while training LoRA adapters in 16-bit floating point. This combination reduces VRAM requirements so dramatically that a 65B parameter model can be fine-tuned on a single 48GB GPU — previously requiring multi-GPU clusters. QLoRA introduces several innovations: NF4 (NormalFloat 4-bit) quantisation, double quantisation for additional memory savings, and paged optimisers to handle memory spikes. It democratised fine-tuning of very large open-weight models for individual researchers and startups.
Examples
Fine-tuning Llama 2 70B on a single consumer A6000 GPU using QLoRA with NF4 quantisation, achieving near full fine-tune quality.
A solo developer fine-tuning Mistral 7B on their laptop using QLoRA in 2 hours on a dataset of 1,000 examples.
Apply this in your prompts
PromptITIN automatically uses techniques like QLoRA to build better prompts for you.
Related Terms
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that updates only small low-rank matric…
View →Parameter-Efficient Fine-Tuning
A family of methods that fine-tune large models by updating only a small fractio…
View →Fine-Tuning
Continuing training of a pretrained model on a smaller, task-specific dataset to…
View →