Llama (Large Language Model)

Definition

Llama is a family of open-weights language models (publicly available weights) developed by Meta, introduced in February 2023 as a contribution to public frontier models. Unlike GPT-4 or Claude (exclusive API access), Llama publishes model weights, allowing researchers and developers to fine-tune, deploy locally, and conduct research without API restrictions.

Llama’s release has significantly democratized access to frontier models, catalyzing an ecosystem of derivatives and open-source applications.

Timeline and Versions

Llama 1 (February 2023): series of 7B-65B parameter models. Performance comparable to GPT-3 and GPT-3.5. Release “accidentally” widened through leak, but later formalized by Meta.

Llama 2 (July 2023): versions 7B, 13B, 70B. Improvements in coding, conversation. Official Meta Llama Community License. Fine-tuned chat variants (Llama 2-Chat) competitive with GPT-3.5.

Llama 3 (April 2024): versions 8B, 70B, 405B. Frontier performance on 8B and 70B (competitive with GPT-4 on many benchmarks). Extended context window to 8K tokens. Training on ~15 trillion tokens (vs. 2T+ for GPT-4).

Llama 3.1 and 3.2 (2024): iterations for improvements on specific tasks, vision capabilities, multilingual support.

Technical Characteristics

Architecture: standard decoder-only Transformer, similar to GPT. Uses:

Grouped-query attention for inference efficiency
RoPE (Rotary Position Embeddings) for position encoding
SwiGLU activation function

Training data: mix of web data, code, synthetic conversations. ~15T tokens for Llama 3 (record quantity, 7-8x more than GPT-4).

Context window: 8K tokens (Llama 3), sufficient for most tasks. Vs. GPT-4 Turbo which has 128K.

Inference efficiency: Llama 7B has ~2x fewer parameters than GPT-3.5 but comparable performance, making it efficient for deployment.

Deployment and Customization

Open-weights = complete flexibility:

Local deployment (on-premise, private cloud) without cloud logging
Custom fine-tuning on proprietary data (privacy guaranteed)
Quantization to reduce memory (int8, int4)
Distillation for smaller models

Ecosystem:

ollama: one-click local deployment
vLLM: performant inference server
LM Studio: GUI for fine-tuning and deployment
Hugging Face: quantized versions (GGUF, AWQ, GPTQ) for multiple hardware

Cost: training Llama 2 70B on H100 GPU costs ~$5M. Custom fine-tuning: GPU hours ($10-100 at scale). Local inference: zero marginal cost (electricity only).

Use Cases

Fine-tuning for specialized domains: medicine, law, finance. Llama base fine-tuned on specialized data often exceeds GPT-4 on in-domain tasks.

Edge deployment: 7B-13B quantized models run on consumer GPU (RTX 3090) or edge TPU. Use for mobile, offline, privacy-critical.

Cost optimization: for high volumes, Llama self-hosted costs 100x less than GPT-4 API. Break-even point: ~10M tokens/month.

Research and experimentation: public weights enable research on interpretability, alignment, safety without proprietary restrictions.

Practical Considerations

Licensing and Commercial Use: Llama Community License has restrictions (entities with over 700M annual revenue need authorization). Carefully verify for commercial use.

Quality vs. Frontier: Llama 3 70B is competitive with GPT-4 on MMLU (~85% vs. ~92%), HumanEval (~83% vs. ~92%), but gap remains on complex reasoning. Trade-off: cost/latency vs. quality.

Fine-tuning quality: with LoRA or QLoRA, fine-tuning Llama 3 70B costs ~$10-50 on consumer hardware. Quality gains of 5-15% accuracy on in-domain tasks, often enough for production.

Community support: Llama community is massive (100K+ practitioners). Fine-tuning resources, guides, and issue resolution are abundant.

Common Misconceptions

”Llama is completely open-source”

Partial. Weights are public, but Meta Llama Community License has commercial restrictions. Not GNU GPL or MIT. Enterprise use may require negotiation.

”Llama is always the right choice for cost savings”

No. Llama 7B self-hosted has compute cost but lower quality than GPT-3.5. For low latency or high reliability, closed-source remains preferable.

”Llama is ready-to-deploy on every task”

Base Llama has decent general performance, but is generic. Fine-tuning on task-specific data is almost always required for production-grade performance.

LLM: category of which Llama is a member
Foundation Model: paradigm of which Llama is an instance
Fine-tuning: common practice with Llama

Sources

Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288
Meta AI (2024). The Llama 3 Herd of Models. arXiv:2407.21783
Llama on Hugging Face Hub