From SEO to GEO: Technical Guide to AI Search Optimization…

The paradigm shift: from links to citations

GPTBot went from 5% to 30% of crawler traffic in one year. Traffic generated by user queries to AI has grown 15 times. Traditional SEO infrastructure no longer intercepts this flow.

TL;DR: Optimization for generative search engines (GEO) requires specific technical interventions: configure robots.txt for 20+ AI crawlers, implement llms.txt to guide LLMs toward priority content, extend structured data with JSON-LD including Person schema with complete E-E-A-T (73% higher selection rate). Structure content in answer blocks of 134-167 words to facilitate extraction. Multimodal content has +156% selection rate. Princeton research shows that adding citations from authoritative sources increases visibility up to 40%. Those who implement now build competitive advantages that are difficult to catch up with.

Traditional SEO optimizes for a specific goal: ranking in the sorted lists returned by search engines. The user searches, receives ten blue links, clicks. Traffic arrives.

Generative search engines work differently. ChatGPT, Perplexity, Gemini, Claude don’t return lists of links. They synthesize answers by drawing from multiple sources, citing (or not) the source. The user gets an answer, not a list of options.

According to Cloudflare data from December 2025, GPTBot reached 30% of AI crawler traffic, up from 5% the previous year. Meta-ExternalAgent entered at 19%. ChatGPT-User, the bot that accesses web pages when users ask questions, registered growth of 2,825%. Traffic related to user queries increased 15 times over the course of the year.

This is not marginal change. It’s a new acquisition channel that requires dedicated infrastructure.

robots.txt: configuration for AI crawlers

The robots.txt file communicates to crawlers which parts of the site they can access. For traditional search engines, the configuration is established. For AI crawlers, the landscape is fragmented: each provider uses different user-agents, with different purposes.

Map of major AI crawlers

OpenAI operates with three distinct crawlers:

User-agent: GPTBot
# Training foundational models. Collects data to train GPT.

User-agent: ChatGPT-User
# User browsing. Accesses pages when a user asks for information.

User-agent: OAI-SearchBot
# Search. Indexes content for ChatGPT's search function.

Anthropic uses:

User-agent: ClaudeBot
# Training and updating Claude.

User-agent: Claude-Web
# Web access for user functionality.

User-agent: anthropic-ai
# Generic Anthropic crawler.

Perplexity:

User-agent: PerplexityBot
# Indexing for AI answer engine.

User-agent: Perplexity-User
# Fetch for user queries.

Google has separated functions:

User-agent: Google-Extended
# Token for AI use. NOT a bot, it's a flag.
# Blocking this user-agent prevents use of content for AI training
# while maintaining standard indexing.

User-agent: Googlebot
# Traditional crawler for Search.

Meta:

User-agent: Meta-ExternalAgent
# Crawling for AI model training.

User-agent: Meta-ExternalFetcher
# Fetch for user requests. Can bypass robots.txt.

Other relevant crawlers:

User-agent: Amazonbot
User-agent: Bytespider      # ByteDance
User-agent: Applebot-Extended  # Apple AI (flag, not bot)
User-agent: CCBot           # Common Crawl
User-agent: cohere-ai
User-agent: cohere-training-data-crawler

Configuration strategies

Strategy 1: Full access for maximum AI visibility

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Amazonbot
Allow: /

Strategy 2: AI search visibility, no training

This configuration allows AI systems to cite your content in responses, but prevents use for training models:

# Allow search/user crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-training-data-crawler
Disallow: /

Strategy 3: Selective access by directory

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /api/
Disallow: /internal/
Disallow: /user-data/

Limitations of robots.txt

A critical point: robots.txt is a voluntary protocol. Crawlers can ignore it.

In August 2025, Cloudflare blocked Perplexity bots after documenting protocol violations. In October 2025, Reddit deliberately trapped Perplexity crawlers, demonstrating they bypassed restrictions through third-party tools. Legal action followed.

The operational consequence: robots.txt alone is not enough. For real enforcement, you need IP verification, WAF rules, or CDN-level blocks. Cloudflare reports that over 2.5 million sites use its managed robots.txt function to block AI training.

llms.txt: the new standard for guiding LLMs

In September 2024, Jeremy Howard of Answer AI proposed llms.txt, a new standard file for communicating with Large Language Models. Unlike robots.txt, which controls access, llms.txt guides models toward the most relevant content.

What llms.txt does

The llms.txt file is a markdown document positioned at the domain root (/llms.txt). It works as a curated map that tells LLMs which pages contain the most important information and how to interpret them.

It’s not a blocking mechanism. It’s a recommendation system, like a librarian guiding a visitor to the right shelves instead of letting them wander.

File structure

# example.com

> Technical site on AI implementations for enterprise.
> Content verified, updated monthly.

## Core Documentation

- [Production RAG Guide](https://example.com/docs/rag-production):
  RAG architectures tested in production, chunking patterns,
  evaluation metrics. Updated Q4 2024.

- [API Reference](https://example.com/docs/api):
  Complete REST API documentation. Includes code examples
  in Python and cURL.

## Technical Articles

- [LLM Latency Optimization](https://example.com/blog/llm-latency):
  Strategies to reduce p95 latency below 200ms.
  Includes benchmarks on Claude, GPT-4, Mistral.

- [AI Cost Management](https://example.com/blog/ai-costs):
  Framework for estimating and optimizing inference costs.
  Real data from enterprise deployments.

## Resources

- [AI Glossary](https://example.com/glossary):
  Technical definitions of 150+ AI/ML terms.

llms-full.txt: extended version

Beyond llms.txt, the standard provides an optional llms-full.txt file containing the full site content in flattened format. It removes non-essential HTML, CSS, JavaScript and presents text only. Some sites generate files of 100K+ words.

The advantage: allows LLMs to process the entire site in a single context. The limitation: easily exceeds the context window of most models.

Adoption status

As of January 2025, OpenAI, Google, and Anthropic do not natively support llms.txt. Their crawlers don’t automatically read the file.

Current adoption is concentrated in specific niches:

Technical documentation: Mintlify integrated llms.txt in November 2024. Documentation sites for Anthropic, Cursor, Cloudflare, Vercel use it.
Dedicated directories: directory.llmstxt.cloud and llmstxt.site catalog sites with implementations.
Manual use: Developers who upload the file directly to ChatGPT or Claude to provide context.

It’s an investment in future-proofing. When major providers adopt the standard, those who have already implemented will have an advantage.

Implementation

Create /llms.txt at the domain root
UTF-8 format, clean markdown
Include only indexable pages (no noindex, no blocked in robots.txt)
Add concise but informative descriptions for each URL
Optional: reference in robots.txt with # LLM-policy: /llms.txt

Differences with other standard files

Comparison of standard files for web and AI crawlers
File	Purpose	Target	Format
robots.txt	Crawler access control	Search engines, AI crawlers	Plain text, directives
sitemap.xml	Complete page catalog	Search engines	XML
llms.txt	Curated priority content map	LLM	Markdown
humans.txt	Team credits	Humans	Plain text

Structured Data and JSON-LD for AI

Structured data is not new. It’s been standard SEO since 2011. But its role changes in the context of generative search engines.

Why Structured Data matters for AI

LLMs process everything as tokens. They don’t natively distinguish between a price, a name, a date. Structured data provides an explicit semantic layer that disambiguates content.

An article with JSON-LD markup communicates in a machine-readable way: this is the author, this is the publication date, this is the publishing organization, these are the sources cited. The model doesn’t have to infer this structure from the text.

Basic JSON-LD implementation

JSON-LD (JavaScript Object Notation for Linked Data) is the preferred format. It’s inserted in a <script> tag without mixing with content HTML:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "@id": "https://example.com/rag-production-guide",
  "headline": "RAG in Production: Patterns and Anti-Patterns",
  "description": "Technical guide to enterprise RAG implementation with real metrics",
  "author": {
    "@type": "Person",
    "name": "Author Name",
    "url": "https://example.com/team/author-name",
    "jobTitle": "AI Team Leader",
    "knowsAbout": ["RAG", "LLM", "Vector Databases", "AI Engineering"]
  },
  "datePublished": "2025-01-04",
  "dateModified": "2025-01-04",
  "publisher": {
    "@type": "Organization",
    "name": "Example.com",
    "url": "https://example.com",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png"
    }
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://example.com/rag-production-guide"
  },
  "articleSection": "Engineering",
  "keywords": ["RAG", "Production", "Enterprise AI", "Vector Search"],
  "wordCount": 3500,
  "inLanguage": "en"
}
</script>

Priority schema types for AI visibility

Article / TechArticle / NewsArticle

For editorial content. TechArticle for technical documentation.

FAQPage

Q&A structure that generative search engines can extract directly:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is the difference between SEO and GEO?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "SEO optimizes for lists of results from traditional search engines. GEO optimizes for being cited in synthesized responses from generative search engines like ChatGPT and Perplexity."
      }
    },
    {
      "@type": "Question",
      "name": "Does llms.txt replace robots.txt?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "No. robots.txt controls crawler access. llms.txt guides LLMs toward priority content. They have complementary functions."
      }
    }
  ]
}
</script>

HowTo

For step-by-step guides:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "How to configure robots.txt for AI crawlers",
  "step": [
    {
      "@type": "HowToStep",
      "position": 1,
      "name": "Identify target AI crawlers",
      "text": "Map the user-agents of AI crawlers you want to allow or block."
    },
    {
      "@type": "HowToStep",
      "position": 2,
      "name": "Define access strategy",
      "text": "Decide whether to allow training, search-only, or block completely."
    },
    {
      "@type": "HowToStep",
      "position": 3,
      "name": "Implement directives",
      "text": "Add User-agent and Allow/Disallow rules to your robots.txt file."
    }
  ]
}
</script>

Organization and Person: E-E-A-T for AI

E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) is no longer just a Google framework. Data shows that LLMs verify author credentials before citing: 96% of content in AI Overview comes from sources with verified authors. Content with detailed author bios has 73% higher selection probability.

The Person schema must go beyond the name. It needs to communicate credentials, affiliations, specific expertise:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Mario Rossi",
  "url": "https://example.com/team/mario-rossi",
  "image": "https://example.com/images/mario-rossi.jpg",
  "jobTitle": "Senior AI Engineer",
  "description": "10+ years of experience in ML/AI, specialized in enterprise RAG systems",
  "worksFor": {
    "@type": "Organization",
    "name": "TechCorp Italia",
    "url": "https://techcorp.it"
  },
  "alumniOf": {
    "@type": "CollegeOrUniversity",
    "name": "Politecnico di Milano"
  },
  "hasCredential": [
    {
      "@type": "EducationalOccupationalCredential",
      "credentialCategory": "certification",
      "name": "AWS Machine Learning Specialty"
    },
    {
      "@type": "EducationalOccupationalCredential",
      "credentialCategory": "certification",
      "name": "Google Cloud Professional ML Engineer"
    }
  ],
  "knowsAbout": [
    "Retrieval-Augmented Generation",
    "Large Language Models",
    "Vector Databases",
    "MLOps",
    "AI Engineering"
  ],
  "sameAs": [
    "https://linkedin.com/in/mariorossi",
    "https://github.com/mariorossi",
    "https://scholar.google.com/citations?user=xxx"
  ]
}
</script>

E-E-A-T checklist for Person schema:

description with years of experience and specialization
hasCredential for verifiable certifications
knowsAbout with specific topics (not generic)
sameAs with links to verified profiles (LinkedIn, GitHub, Google Scholar)
alumniOf for academic affiliations
worksFor with organization URL

Citation schema

For content that cites external sources, the Citation schema adds context:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Analysis of GEO Paper Princeton",
  "citation": [
    {
      "@type": "ScholarlyArticle",
      "name": "GEO: Generative Engine Optimization",
      "author": ["Pranjal Aggarwal", "et al."],
      "datePublished": "2024",
      "publisher": {
        "@type": "Organization",
        "name": "Princeton University"
      },
      "url": "https://arxiv.org/abs/2311.09735"
    }
  ]
}
</script>

ImageObject and VideoObject for multimodal content

Multimodal content has 156% higher probability of being selected in AI Overview compared to text-only content. Gemini and Perplexity invest heavily in multimodal search. Schema for media becomes relevant:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "ImageObject",
  "contentUrl": "https://example.com/images/architettura-rag.png",
  "name": "Enterprise RAG system architecture",
  "description": "Architectural diagram showing data flow between vector store, retriever and LLM in a production RAG system",
  "author": {
    "@type": "Person",
    "name": "Mario Rossi"
  },
  "datePublished": "2025-01-04",
  "encodingFormat": "image/png"
}
</script>

For video with transcription:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "VideoObject",
  "name": "Deploy RAG in production: walkthrough",
  "description": "Video tutorial on deploying a RAG system on AWS with monitoring",
  "thumbnailUrl": "https://example.com/video/rag-deploy-thumb.jpg",
  "uploadDate": "2025-01-04",
  "duration": "PT12M30S",
  "transcript": "https://example.com/video/rag-deploy-transcript.txt",
  "author": {
    "@type": "Person",
    "name": "Mario Rossi"
  }
}
</script>

Best practices for AI-friendly media:

Descriptive and contextual alt text (not “image1.png”)
Captions that explain the content, not just describe it
Transcriptions for all videos
Captions that contextualize the figure in surrounding text

Real impact on AI search

John Mueller of Google clarified in January 2025 that structured data is not a direct ranking factor. But the indirect impact is documented:

Rich snippets from structured data increase CTR by 30% according to BrightEdge
72% of sites on Google’s first page use schema markup
Google’s AI Overviews process structured data to build responses

Structured data doesn’t guarantee citations in generative search engines. But it provides the semantic context that facilitates correct interpretation of content.

Technical requirements for AI crawlers

Beyond structured data, there are technical requirements that influence LLMs’ ability to process and cite content.

Static HTML vs JavaScript rendering

AI crawlers struggle with JavaScript-rendered content. Unlike Googlebot, which executes JS, many AI crawlers prefer or require static HTML.

Operating rules:

Critical content must be present in static HTML, not generated dynamically
Avoid content hidden in tabs, accordions, or loaded on-scroll
If you use JS frameworks (React, Vue, Next.js), verify that SSR or SSG produces complete HTML
Test: view the page with JS disabled. What you see is what base AI crawlers see.

Content freshness signals

23% of content selected in AI Overview is less than 30 days old. Perplexity indexes daily. Freshness signals are prioritized over historical authority.

Implementation:

dateModified in schema must reflect actual updates:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "Production RAG Guide",
  "datePublished": "2024-06-15",
  "dateModified": "2025-01-04"
}
</script>

Freshness checklist:

Update dateModified only for substantial changes (not typo fixes)
Prominently signal updates in content (“Updated: January 2025”)
Quarterly review of evergreen content
Update statistics and data at least annually
Remove or mark obsolete content as archived

Citation verification and fact-checking

AI systems cross-reference with authoritative sources in real-time. Content with verifiable citations has 89% higher selection probability compared to content with unsupported claims.

Rules:

Every statistic must have a linked source
“According to research” without a link = unverifiable claim = penalized
Prefer primary sources (papers, official documentation) over secondary sources
Citations from Wikipedia, Statista, Pew Research, arXiv papers carry more weight

GEO strategies: what the research says

The “GEO: Generative Engine Optimization” paper from Princeton, Georgia Tech, Allen Institute, and IIT Delhi is the most rigorous available study on optimization for generative search engines. It tested 9 techniques on 10,000 queries.

The three most effective strategies

1. Cite Sources: +40% visibility

Adding citations from authoritative sources is the strategy with the highest overall impact. For sites with low ranking in traditional SERPs, the effect is even more pronounced: +115% for sites in fifth position.

Simply citing is not enough. The citation must be from a recognized, relevant source, verifiable.

2. Quotation Addition

Incorporating direct quotes from industry experts increases authenticity and perceived depth. Works particularly well for opinion-based content.

3. Statistics Addition

Quantitative data beats qualitative discussion. “42% of AI projects fail” has more impact than “many AI projects fail”. Works particularly well for Legal and Government domains.

Structuring content for extraction: Answer Blocks

LLMs don’t cite entire pages. They extract specific blocks. Optimizing for this pattern is critical.

Passage length optimal: 134-167 words per citable block. For direct FAQ answers: 40-60 words. Content with summary boxes at the beginning has 28-40% higher citation probability.

Practical implementation:

TL;DR at the beginning: Every article opens with a self-contained summary block. It’s not just for human readers: it’s the block that LLMs preferentially extract.
Self-contained sections: Each H2/H3 should be citable independently from the rest. An LLM should be able to extract that section and have a complete answer.
Heading as questions: “What is RAG?” performs better than “RAG Overview”. Direct matching with conversational queries.
Modular paragraphs: 75-300 words per section. No wall of text. Modular blocks are easier to extract and cite.
Direct answers first, context after: The answer to the heading’s implicit question should appear in the first 2-3 sentences. Elaboration comes after.

Example of optimized structure:

## What is the difference between SEO and GEO?

SEO optimizes for ranking in lists of results from traditional
search engines. GEO optimizes for being cited in synthesized
responses from generative search engines like ChatGPT, Perplexity
and Gemini. [40-60 words of direct answer]

The fundamental change concerns the objective: from ranking to
citation. In classical SEO, success is position 1 in the SERPs.
In GEO, success is being the source that the AI cites when responding.
[Elaboration and context]

Domain-specific strategies

The paper found that effectiveness varies by domain:

History: Authoritative and persuasive tone
Facts: Citations from primary sources
Law/Government: Statistics and quantitative data
Science/Health: Technical terminology + authoritativeness

Platform-specific optimization

Each LLM has different preferences. An effective GEO strategy considers these differences:

Optimization preferences for generative AI platforms
Platform	Main preferences	Optimization
ChatGPT	Wikipedia, popular brands, established content	Authority building, Wikipedia presence if applicable
Perplexity	Reddit, recent content, real-time	Freshness priority, community engagement
Gemini	Multimodal, Google ecosystem, schema markup	Video, optimized images, complete structured data
Claude	Accuracy, balanced content, attribution	Proper attribution, neutral and evidence-based framing
Google AI Overview	Top 10 organic, strong E-E-A-T	Traditional SEO + extended structured data

Operational implications:

ChatGPT cites Wikipedia in 48% of responses. For topics with a Wikipedia entry, presence there matters.
Perplexity prefers Reddit (46.7% of citations). Content discussed in relevant subreddits has an advantage.
Gemini integrates images and video into responses. Multimodal content performs better.
Claude verifies accuracy more rigorously. Unsupported claims are discarded.

What doesn’t work

Keyword stuffing: Adding keywords from the query to content worsens visibility by 10% compared to baseline. Generative search engines penalize over-optimization.

Generic persuasive language: Persuasive tone without substance doesn’t improve ranking.

Democratization of results

An interesting aspect: GEO levels the playing field. Sites with low ranking in traditional SERPs benefit more from GEO optimizations than dominant sites. Cite Sources brings +115% to sites in fifth position and -30% to sites in first position.

For small publishers and independent businesses, it’s an opportunity to compete with corporate giants without comparable SEO budgets.

Implementation checklist

robots.txt

Map all AI crawlers relevant to your industry
Define strategy: full access, search-only, selective
Implement directives for each user-agent
Verify syntax with Google Robots Testing Tool
Monitor server logs for crawler activity
Verify actual compliance (IP check for suspicious crawlers)
Quarterly review: new crawlers emerge regularly

llms.txt

Create markdown file at domain root
Include site description and content type
Organize URLs by category/priority
Add concise descriptions for each link
Verify that all URLs are indexable
Consider llms-full.txt for sites with extended documentation
Update when new priority content is published

Structured Data / JSON-LD

GEO-optimized content

Technical requirements

Critical content in static HTML (not JS-only rendering)
No content hidden in tabs/accordions/lazy-load
Test page with JavaScript disabled
dateModified updated for substantial changes
Signal updates in content (“Updated: Month Year”)
Quarterly review of evergreen content
Every statistic with linked source

Media and Multimodal

Descriptive and contextual alt text for images
Captions that explain the content
Transcriptions for all videos
ImageObject/VideoObject schema implemented
Captions that contextualize figures in surrounding text

Monitoring

Track AI crawler activity in server logs
Monitor brand mentions in ChatGPT/Perplexity/Gemini responses
Analyze competitor citation share
Measure referral traffic from AI platforms
Monthly metrics review

The window of opportunity

Cloudflare data shows that crawling for AI training still dominates traffic, with volumes 8 times higher than search crawling and 32 times higher than user-action crawling. But the trend is clear: user-action traffic is growing faster than any other category.

Those who implement GEO infrastructure now build advantages that accumulate over time. Citations generate other citations. Authority recognized by models strengthens. First-mover advantage in this space isn’t just about technical positioning: it’s about building an established presence before competition intensifies.

Traditional SEO doesn’t disappear. It continues to serve 70% of search traffic that still goes through classic SERPs. But the remaining 30%, and its growth trajectory, requires new tools.

Sources

Aggarwal, P., et al. (2024). GEO: Generative Engine Optimization. arXiv:2311.09735. Princeton University, Georgia Tech, Allen Institute for AI, IIT Delhi.

AI Mode Boost. (2025). AI Overview Ranking Factors: 2025 Comprehensive Study.

Cloudflare. (2025, December). From Googlebot to GPTBot: Who’s Crawling Your Site in 2025. Cloudflare Blog.

Dataslayer. (2025). Google AI Overviews Impact 2025: CTR Down 61%.

Howard, J. (2024, September). llms.txt Proposal. Answer AI.

W3C Schema Community. (2024). Schema Vocabulary Documentation.

SEO Sherpa. (2025, October). Google AI Search Guidelines 2025.

Single Grain. (2025, October). Google AI Overviews: The Ultimate Guide to Ranking in 2025.

Yoast. (2025). Structured Data with Schema for Search and AI.

Overdrive Interactive. (2025, July). LLMs.txt: The New Standard for AI Crawling.