Definition
Llama is a family of open-weights language models (publicly available weights) developed by Meta, introduced in February 2023 as a contribution to public frontier models. Unlike GPT-4 or Claude (exclusive API access), Llama publishes model weights, allowing researchers and developers to fine-tune, deploy locally, and conduct research without API restrictions.
Llama’s release has significantly democratized access to frontier models, catalyzing an ecosystem of derivatives and open-source applications.
Timeline and Versions
Llama 1 (February 2023): series of 7B-65B parameter models. Performance comparable to GPT-3 and GPT-3.5. Release “accidentally” widened through leak, but later formalized by Meta.
Llama 2 (July 2023): versions 7B, 13B, 70B. Improvements in coding, conversation. Official Meta Llama Community License. Fine-tuned chat variants (Llama 2-Chat) competitive with GPT-3.5.
Llama 3 (April 2024): versions 8B, 70B, 405B. Frontier performance on 8B and 70B (competitive with GPT-4 on many benchmarks). Extended context window to 8K tokens. Training on ~15 trillion tokens (vs. 2T+ for GPT-4).
Llama 3.1 and 3.2 (2024): iterations for improvements on specific tasks, vision capabilities, multilingual support.
Technical Characteristics
Architecture: standard decoder-only Transformer, similar to GPT. Uses:
- Grouped-query attention for inference efficiency
- RoPE (Rotary Position Embeddings) for position encoding
- SwiGLU activation function
Training data: mix of web data, code, synthetic conversations. ~15T tokens for Llama 3 (record quantity, 7-8x more than GPT-4).
Context window: 8K tokens (Llama 3), sufficient for most tasks. Vs. GPT-4 Turbo which has 128K.
Inference efficiency: Llama 7B has ~2x fewer parameters than GPT-3.5 but comparable performance, making it efficient for deployment.
Deployment and Customization
Open-weights = complete flexibility:
- Local deployment (on-premise, private cloud) without cloud logging
- Custom fine-tuning on proprietary data (privacy guaranteed)
- Quantization to reduce memory (int8, int4)
- Distillation for smaller models
Ecosystem:
- ollama: one-click local deployment
- vLLM: performant inference server
- LM Studio: GUI for fine-tuning and deployment
- Hugging Face: quantized versions (GGUF, AWQ, GPTQ) for multiple hardware
Cost: training Llama 2 70B on H100 GPU costs ~$5M. Custom fine-tuning: GPU hours ($10-100 at scale). Local inference: zero marginal cost (electricity only).
Use Cases
Fine-tuning for specialized domains: medicine, law, finance. Llama base fine-tuned on specialized data often exceeds GPT-4 on in-domain tasks.
Edge deployment: 7B-13B quantized models run on consumer GPU (RTX 3090) or edge TPU. Use for mobile, offline, privacy-critical.
Cost optimization: for high volumes, Llama self-hosted costs 100x less than GPT-4 API. Break-even point: ~10M tokens/month.
Research and experimentation: public weights enable research on interpretability, alignment, safety without proprietary restrictions.
Practical Considerations
Licensing and Commercial Use: Llama Community License has restrictions (entities with over 700M annual revenue need authorization). Carefully verify for commercial use.
Quality vs. Frontier: Llama 3 70B is competitive with GPT-4 on MMLU (~85% vs. ~92%), HumanEval (~83% vs. ~92%), but gap remains on complex reasoning. Trade-off: cost/latency vs. quality.
Fine-tuning quality: with LoRA or QLoRA, fine-tuning Llama 3 70B costs ~$10-50 on consumer hardware. Quality gains of 5-15% accuracy on in-domain tasks, often enough for production.
Community support: Llama community is massive (100K+ practitioners). Fine-tuning resources, guides, and issue resolution are abundant.
Common Misconceptions
”Llama is completely open-source”
Partial. Weights are public, but Meta Llama Community License has commercial restrictions. Not GNU GPL or MIT. Enterprise use may require negotiation.
”Llama is always the right choice for cost savings”
No. Llama 7B self-hosted has compute cost but lower quality than GPT-3.5. For low latency or high reliability, closed-source remains preferable.
”Llama is ready-to-deploy on every task”
Base Llama has decent general performance, but is generic. Fine-tuning on task-specific data is almost always required for production-grade performance.
Related Terms
- LLM: category of which Llama is a member
- Foundation Model: paradigm of which Llama is an instance
- Fine-tuning: common practice with Llama
Sources
- Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
- Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288
- Meta AI (2024). The Llama 3 Herd of Models. arXiv:2407.21783
- Llama on Hugging Face Hub