10 AI Search (GEO) optimization tips from the makers of an AI Search Engine

Mark Lowe
Mark Lowe 24.10.2025

As makers of an AI search engine, we know a thing or two about how web content is processed in order to be interpreted by an LLM. Common AI search engines like ChatGPT, Perplexity and also Google all use these techniques.

This post gives an overview over this process and shows 10 tips on how to optimize web content for LLMs.

Semantic Search Image

How ChatGPT comes up with answers

If you ask, "When did Napoleon Bonaparte rule France?", it will be able to answer your question immediately based on the LLM's trained knowledge. On the other hand, if you ask "hey what's the weather tomorrow?", an LLM like GPT-5 will need some help because it simply cannot know the answer. ChatGPT running on top of GPT-5 though can make use of a "Web Search" Tool which looks up this information in the web and passes it on to the LLM for it to generate an answer. The weather data that was found is referred to as retrieved knowledge and this process is called RAG.

Trained vs Retrieved Knowledge

Trained Knowledge

Retrieved Knowledge

Information used at training time. This is usually made up of millions of websites, books, source code, scientific papers and also synthetic content built for training purposes.

Information that is looked up live while an answer is given to you by a Chat Agent.
This information is retrieved using Tools like Web Search.

Good for general knowlege

Good for recent events, specific information that is not part of trained knowledge

Has a Cutoff-Date. This is the date to which the training data reaches to. The LLM cannot know anything that happened after that date.
For example GPT-5, published in August 2025 has a cutoff date of October, 2024.

Data is fetched live. There is no cutoff date.

So why are LLMs not trained continuously using new data? One reason for this is that training an LLM takes a lot of time (usually months) and a massive amount of energy. LLMs are not like databases where data can be added to - every new set of training data requires re-balancing a model's billions of parameters. Read this article for details.

Retrieval techniques on the other hand have proven to be much more effective for this. We can all see this in our daily work with ChatGPT which uses retrieval techniques to search the web.

How does ChatGPT search the web?

There are a number of steps involved when generative Engines search for information. Let's look at an example:

Hey ChatGPT, tell me what KeySemantics say about RAG

1. Search for matching web pages

ChatGPT will automatically search for something like "keysemantics RAG" and look at the top results' titles and descriptions. Similar to what we do when we "google" for things. Chat GPT uses its own search engine for this.

Search Lookup

Screenshot showing Google search results for "keysemantics RAG"

2. Fetch content from interesting pages

It will read the content of the most interesting pages. To keep this process fast, ChatGPT does not execute any scripts on these pages.

Fetched HTML

This image shows raw HTML that has been fetched from a URL

3. Clean and convert the fetched content to Markdown

  • Remove Header, Footer, Navigation, Ad banners
  • Take only main content
Cleaned content

This image shows content that has been cleaned and converted to markdown.

4. Split the cleaned content into smaller "chunks"

The content is split into smaller text blocks called "chunks". They typically contain one or two paragraphs of text.

Splitted chunks

This image shows text chunks that have been created from the previously extracted markdown

5. Select best matching chunks

Select the best matching chunks. This is often done by calculating the similarity of the given chunk to the user's question. We'll not go into details about vector embeddings and cosine similarity here.

Selected Chunks

This image shows relevant parts of a blog post that have been selected.

6. Create a LLM prompt from the selected chunks

A prompt is constructed and sent to the LLM. The prompt includes only the selected chunks as data from which the LLM will try to answer the question.

RAG Prompt

This image shows a ChatGPT prompt that asks it to answer a question based on provided data.

Why are steps 3 to 5 needed, LLMs could just process HTML?

True, but for best performance (speed, answer quality, energy efficiency) it is important to minimize the data fed into the LLM's context. It mainly comes down to energy usage (more word tokens to process = more energy) which makes it very likely that ChatGPT, Perplexity and Co follow a similar process of cleaning data before generating answers. This is just an assumption though.

Why should they convert HTML to Markdown instead of just plain text?

Markdown is a format that allows to retain simple formatting (headings, listings, tables,...) while using minimal characters. LLMs have been trained on vast amounts of Markdown-Formatted scientific papers, converted documents, and more which lets them natively "understand" this syntax better than plain text.

10 AI Search Optimization Tips

Phew, we made it though the theory. Let's continue with best practices based on what we've learned so far.

Tip 1: Disable JavaScript and test if your site still displays content

  • Generative Engines do not execute scripts
  • Many Single-Page applications won't display content without JavaScript
  • Server-side rendering (SSR) and static site generation (SSG) can help with this

Test if your site returns all important content even if JavaScript is disabled. In Chrome: Open Dev Tools (F12) > Command + Shift + P > Type JavaScript > select "Disable JavaScript"

Disable Javascript

The image shows how to disable Javascript in Chrome's dev tools.

Tip 2: Make sure your content is structured well

  • Avoid huge paragraphs
  • Use Subheadings, Paragraphs, Lists, Quotes where applicable
Content Structure

Well structured content.

Tip 3: One Page per Topic

  • Avoid spreading content that belongs together over multiple pages
  • Create main topic pages with summaries and the most important information
  • Branch down into detail pages for very specific information.

Tip 4: Use Semantic Markup

  • Use main / article tag for the main content block
  • Use nav tags for navigation
  • Use header / footer tags

Note: Using Json Schema can make sense for structured information (product info / author etc) but it is under debate if AI Search engines actually process this data or just ignore it altogether.

Semantic Markup

This image shows basic semantic HTML

Tip 5. Add a Summary at the beginning of Articles

  • Summaries help users decide if this text is worth reading
  • Often, a question can be answered already from the summary making processing for AI Search easier
Summaries

A sample of a summary at the beginning of a long content page.

Tip 6. Add a Q&A Section on longer articles

  • Q&A Help users reiterate over the content
  • Q&A convert into chunks nicely which helps LLMs

Q&A sections are a fantastic data source for any AI-based search because they usually precisely answer a specific question and can usually be processed as a single chunk.

Q&A Section

Example of a Q&A Section

Tip 7. Ensure Images have Captions and Alt-Text

  • An image without caption or alt-text is invisible to LLMs.
  • Although LLMs could read text in images, it is very likely that they don't because of the time and cost involved.
Image Captions

Image and an image caption.

Tip 8. Include Transcripts for videos

To make video content visible to LLMs, add a transcript

Video Transcripts

Video player with an attached transcript which can be toggled.

Tip 9: Web Content is King

PDF and other documents are hard to process for any search engine and are also difficult to handle for accessibility tools like screen readers. PDF downloads can make sense as an additional source of information i.e. for printing out but the main information should be available as web content on your website.

PDF Download

Example of a quick summary together with a download link for a PDF.

Tip 10: Page Speed is Key

  • Fast loading times have suddenly become critically important
  • Use Static Site Generation and CDNs to improve loading speed
  • Look specifically for a good Time to first Byte metric (TTFB)
Lighthouse Performance Score

This image shows a Lighthouse Performance Score

How to control what parts of my content AI Search engines are allowed to process?

robots.txt

  • Specifically control which crawlers are allowed to see specific parts of your site
  • Used by OpenAI. See their documentation on this.

sitemap.xml

  • Tells crawlers which pages are available on your site
  • Ensure the sitemap is always up to date
  • Ensure no broken links or irrelevant content is included here

LLMs.txt

This is a proposed standard. The file should act as a guide for LLM-powered search engines to what your site is about and what content is important. Currently (Oct 2025) none of the major AI search providers read this file and it is unclear if they ever will. Our tests show that the file is currently only read by a small number of specialized bots.

There is no harm in setting up a good LLMs.txt but the content needs to be crafted carefully and regularly maintained as your site evolves.

Q&A

When you ask ChatGPT a question, it doesn’t “know” the answer itself — the underlying LLM (like GPT-5) has a knowledge cutoff and can’t access real-time information. Instead, ChatGPT searches the web, fetches relevant pages, cleans and converts the content (often into Markdown), splits it into small “chunks,” selects the best-matching ones, and then constructs a prompt for the LLM to generate a response. This process is designed to make retrieval fast, accurate, and energy-efficient.

Training a large language model takes months and consumes vast amounts of energy. Continuously retraining with new data isn’t practical or sustainable. Attempts to “incrementally” add new data during training have led to major issues that remain unsolved. Instead, AI systems rely on retrieval techniques — dynamically fetching fresh information from the web — which is faster, cheaper, and more reliable.

AI engines don’t process raw HTML directly. They:

  • Fetch web pages (without running scripts)
  • Extract only the main content (remove ads, navigation, etc.)
  • Convert it to Markdown
  • Split it into “chunks”
  • Select the most relevant chunks for answering the question

This ensures only clean, concise data is sent to the LLM, improving speed, quality, and energy efficiency.

Follow these 10 practical tips, including:

  • Ensure your content is visible without JavaScript (no client-side rendering only)
  • Use clear structure: subheadings, paragraphs, lists
  • Create one page per topic
  • Use semantic HTML tags (<main>, <article>, <header>, <footer>)
  • Add summaries, Q&A sections, alt text, and video transcripts
  • Prefer web content over PDFs
  • Keep your site fast and accessible
  • Maintain accurate robots.txt, sitemap.xml, and optionally an LLMs.txt

These help both users and AI engines understand, index, and reuse your content effectively.

What can KeySemantics offer?

We offer AI powered search for your own website. Our search engine can easily be set up through our online portal and then integrated directly into your page using our widgets our our APIs.

  • Regular crawling of your web content => your information is always up to date
  • State-of the art retrieval system using AI-powered retrieval together with our strong knowledge-graph
  • Search in natural language. Over 50 languages supported.
  • APIs for Search, Chat, RAG, Q&As, Related Content
  • Use our SaaS platform or host it yourself

Send us a message if you're interested in a demo. We'll be in touch with you straight away.

Semantic Tags

AI SearchLLMChatGPTretrievalRAGweb contentMarkdownsemantic markupoptimization tipsQ&A sectionsJavaScriptpage speedstructured contentPDFtranscriptstraining dataknowledge cutoffvector embeddingscosine similarityLLMs.txt