How ChatGPT comes up with answers
If you ask, "When did Napoleon Bonaparte rule France?", it will be able to answer your question immediately based on the LLM's trained knowledge. On the other hand, if you ask "hey what's the weather tomorrow?", an LLM like GPT-5 will need some help because it simply cannot know the answer. ChatGPT running on top of GPT-5 though can make use of a "Web Search" Tool which looks up this information in the web and passes it on to the LLM for it to generate an answer. The weather data that was found is referred to as retrieved knowledge and this process is called RAG.
Trained vs Retrieved Knowledge
Trained Knowledge | Retrieved Knowledge |
|---|---|
Information used at training time. This is usually made up of millions of websites, books, source code, scientific papers and also synthetic content built for training purposes. | Information that is looked up live while an answer is given to you by a Chat Agent. |
Good for general knowlege | Good for recent events, specific information that is not part of trained knowledge |
Has a Cutoff-Date. This is the date to which the training data reaches to. The LLM cannot know anything that happened after that date. | Data is fetched live. There is no cutoff date. |
So why are LLMs not trained continuously using new data? One reason for this is that training an LLM takes a lot of time (usually months) and a massive amount of energy. LLMs are not like databases where data can be added to - every new set of training data requires re-balancing a model's billions of parameters. Read this article for details.
Retrieval techniques on the other hand have proven to be much more effective for this. We can all see this in our daily work with ChatGPT which uses retrieval techniques to search the web.
How does ChatGPT search the web?
There are a number of steps involved when generative Engines search for information. Let's look at an example:
Hey ChatGPT, tell me what KeySemantics say about RAG
1. Search for matching web pages
ChatGPT will automatically search for something like "keysemantics RAG" and look at the top results' titles and descriptions. Similar to what we do when we "google" for things. Chat GPT uses its own search engine for this.
2. Fetch content from interesting pages
It will read the content of the most interesting pages. To keep this process fast, ChatGPT does not execute any scripts on these pages.
3. Clean and convert the fetched content to Markdown
- Remove Header, Footer, Navigation, Ad banners
- Take only main content
4. Split the cleaned content into smaller "chunks"
The content is split into smaller text blocks called "chunks". They typically contain one or two paragraphs of text.
5. Select best matching chunks
Select the best matching chunks. This is often done by calculating the similarity of the given chunk to the user's question. We'll not go into details about vector embeddings and cosine similarity here.
Why are steps 3 to 5 needed, LLMs could just process HTML?
True, but for best performance (speed, answer quality, energy efficiency) it is important to minimize the data fed into the LLM's context. It mainly comes down to energy usage (more word tokens to process = more energy) which makes it very likely that ChatGPT, Perplexity and Co follow a similar process of cleaning data before generating answers. This is just an assumption though.
Why should they convert HTML to Markdown instead of just plain text?
Markdown is a format that allows to retain simple formatting (headings, listings, tables,...) while using minimal characters. LLMs have been trained on vast amounts of Markdown-Formatted scientific papers, converted documents, and more which lets them natively "understand" this syntax better than plain text.
10 AI Search Optimization Tips
Phew, we made it though the theory. Let's continue with best practices based on what we've learned so far.
Tip 1: Disable JavaScript and test if your site still displays content
- Generative Engines do not execute scripts
- Many Single-Page applications won't display content without JavaScript
- Server-side rendering (SSR) and static site generation (SSG) can help with this
Test if your site returns all important content even if JavaScript is disabled. In Chrome: Open Dev Tools (F12) > Command + Shift + P > Type JavaScript > select "Disable JavaScript"
Tip 2: Make sure your content is structured well
- Avoid huge paragraphs
- Use Subheadings, Paragraphs, Lists, Quotes where applicable
Tip 3: One Page per Topic
- Avoid spreading content that belongs together over multiple pages
- Create main topic pages with summaries and the most important information
- Branch down into detail pages for very specific information.
Tip 4: Use Semantic Markup
- Use main / article tag for the main content block
- Use nav tags for navigation
- Use header / footer tags
Note: Using Json Schema can make sense for structured information (product info / author etc) but it is under debate if AI Search engines actually process this data or just ignore it altogether.
Tip 5. Add a Summary at the beginning of Articles
- Summaries help users decide if this text is worth reading
- Often, a question can be answered already from the summary making processing for AI Search easier
Tip 6. Add a Q&A Section on longer articles
- Q&A Help users reiterate over the content
- Q&A convert into chunks nicely which helps LLMs
Q&A sections are a fantastic data source for any AI-based search because they usually precisely answer a specific question and can usually be processed as a single chunk.
Tip 7. Ensure Images have Captions and Alt-Text
- An image without caption or alt-text is invisible to LLMs.
- Although LLMs could read text in images, it is very likely that they don't because of the time and cost involved.
Tip 8. Include Transcripts for videos
To make video content visible to LLMs, add a transcript
Tip 9: Web Content is King
PDF and other documents are hard to process for any search engine and are also difficult to handle for accessibility tools like screen readers. PDF downloads can make sense as an additional source of information i.e. for printing out but the main information should be available as web content on your website.
How to control what parts of my content AI Search engines are allowed to process?
robots.txt
- Specifically control which crawlers are allowed to see specific parts of your site
- Used by OpenAI. See their documentation on this.
sitemap.xml
- Tells crawlers which pages are available on your site
- Ensure the sitemap is always up to date
- Ensure no broken links or irrelevant content is included here
LLMs.txt
This is a proposed standard. The file should act as a guide for LLM-powered search engines to what your site is about and what content is important. Currently (Oct 2025) none of the major AI search providers read this file and it is unclear if they ever will. Our tests show that the file is currently only read by a small number of specialized bots.
There is no harm in setting up a good LLMs.txt but the content needs to be crafted carefully and regularly maintained as your site evolves.
















