Pretraining
The initial phase of training a model on massive text data to learn general language representations.
Full Definition
Pretraining is the first and most compute-intensive phase of building a large language model. The model is trained on trillions of tokens from diverse internet sources, books, and code using a self-supervised objective — typically next-token prediction (causal language modelling) for GPT-style models or masked language modelling for BERT-style models. No human-labelled data is required; the training signal comes from predicting the natural continuations of text. Pretraining encodes broad world knowledge, language syntax, and reasoning patterns into the model's weights. It requires thousands of GPUs running for weeks and constitutes the majority of the total training cost for frontier models.
Examples
GPT-4 being pretrained on approximately 13 trillion tokens scraped from the internet, books, and code repositories.
Llama 3 pretraining on over 15 trillion tokens, including a higher proportion of code and multilingual data than its predecessors.
Apply this in your prompts
PromptITIN automatically uses techniques like Pretraining to build better prompts for you.