Large Language Model (LLM)

Definition

A Large Language Model (LLM) is a deep learning model with billions of parameters, trained on massive text corpora to predict the next token in a sequence. This predictive capacity emerges as the ability to understand, generate, and manipulate natural language.

Modern LLMs are based on the Transformer architecture and undergo two training phases: pre-training on web-scale data (hundreds of billions of tokens) and subsequent alignment via RLHF or similar techniques.

Key Characteristics

Scale: frontier models have 100B-1T+ parameters. GPT-4 is estimated around 1.7T parameters (not officially confirmed). Smaller models (7B-70B) offer interesting trade-offs between performance and cost.

Emergent abilities: capabilities that only appear beyond certain scale thresholds, such as multi-step reasoning, in-context learning, and complex instruction-following. The phenomenon is documented but not fully understood.

Context window: the amount of text the model can process in a single call. Ranges from 4K tokens (legacy models) to 128K-1M+ tokens (Claude, Gemini). Directly influences possible use cases.

How It Works

The base architecture involves:

Tokenization: text is converted to tokens (sub-word units) via algorithms like BPE or SentencePiece
Embedding: each token becomes a dense vector
Transformer layers: attention mechanisms process the sequence, capturing long-range dependencies
Output: probability distribution over the vocabulary for the next token

Training occurs on next-token prediction: given a prefix, predict the next token. This seemingly simple task, at sufficient scale, produces surprising generalist capabilities.

Main Models (2025)

Closed-source: GPT-4/4o (OpenAI), Claude 3.5 (Anthropic), Gemini 1.5 (Google). Accessible only via API, with per-token costs.

Open-weights: Llama 3 (Meta), Mistral, Qwen, DeepSeek. Public weights, local deployment possible. Variable licenses (some with commercial restrictions).

Reference benchmarks: MMLU for general knowledge, HumanEval for coding, GPQA for scientific reasoning.

Practical Considerations

Costs: vary by 10-100x between models. GPT-4o costs ~$5/million input tokens, GPT-4o-mini ~$0.15. Model choice significantly impacts application TCO.

Latency: time-to-first-token (TTFT) and tokens/second vary by provider and model. For real-time applications, latency can be more constraining than cost.

Rate limits: APIs have limits on requests/minute and tokens/minute. At scale, these become an architectural constraint.

Common Misconceptions

”LLMs understand what they say”

No. They produce statistically plausible output based on learned patterns. They have no model of the world, beliefs, or understanding in the cognitive sense. This explains hallucinations.

”The largest model is always best”

It depends on the task. For many use cases, 7B-70B models fine-tuned on specific tasks outperform models 10x larger, on specific metrics, at a fraction of the cost.

”LLMs remember previous conversations”

No. Each API call is stateless. “Memory” is simulated by including history in the prompt, consuming context window.

Transformer: the architecture underlying LLMs
Fine-tuning: adapting an LLM to specific tasks
RAG: pattern for integrating external knowledge into LLMs
Prompt Engineering: optimizing inputs to LLMs
Hallucination: confident but incorrect output

Sources

Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv
Wei, J. et al. (2022). Emergent Abilities of Large Language Models. TMLR
Artificial Analysis: independent LLM benchmark

Large Language Model (LLM)

Definition

Key Characteristics

How It Works

Main Models (2025)

Practical Considerations

Common Misconceptions

”LLMs understand what they say”

”The largest model is always best”

”LLMs remember previous conversations”

Sources

Related Articles

Constitutional AI: A Guide for Claude Users

From SEO to GEO: Technical Guide to AI Search Optimization

Large Language Model (LLM)

Definition

Key Characteristics

How It Works

Main Models (2025)

Practical Considerations

Common Misconceptions

”LLMs understand what they say”

”The largest model is always best”

”LLMs remember previous conversations”

Related Terms

Sources

Related Articles

Constitutional AI: A Guide for Claude Users

From SEO to GEO: Technical Guide to AI Search Optimization