The Two Data Pools Behind Every AI Answer | Understanding GEO Part 2

When AI builds an answer for a marketing query, it does not read your website. It reaches into two different pools of information, weighs what it finds, and constructs a response. One pool is what the model already knows, sealed in at training time. The other is what it can fetch from the live web in the moment.

Most marketers do not see this. They picture AI as a system that reads pages the way Google did. The shift from indexing to memory is bigger than that. Which pool the model draws from at any given moment changes which brands appear, which sources it trusts, and how confident it sounds about you.

If a CMO is going to develop one new lens on AI search this quarter, this is the one. Part 1 of this series covered the entity model. This part explains where the model gets its information from in the first place.

The two pools, plainly explained

Every AI engine works from two information sources, in different proportions depending on the model and the question.

The first pool is training data. This is what the model learned when it was built. A frozen snapshot of the internet, baked in at training time, weighed against billions of examples, then packed into the model's weights. The model does not look it up when you ask a question. The pattern is already inside it. This is the model's memory.

The second pool is live retrieval. This is what the model fetches at the moment you ask. The same idea as a search engine reading the web in real time. The model issues queries, pulls back current pages, reads what it finds, and uses that to construct or supplement its answer. This is the model's eyes.

Every answer ChatGPT, Gemini, Perplexity or Claude gives is some mix of the two. Some lean on memory. Some lean on retrieval. Most blend both. Knowing which one is doing the work in your category is where useful GEO strategy begins.

“

Memory tells the model what to believe about your brand. Live retrieval shows the model what is being said about your brand right now. The two do not always agree, and when they conflict, the brand pays for the model's decision.

Pool 1: Training data, the model's memory

Training data is everything the model was exposed to during its build. Books, Wikipedia, Reddit, news archives, forums, code repositories, the web scraped and filtered. Billions of pages, ingested, weighed for quality, then used to teach the model what concepts look like and which patterns hang together.

Three things matter about training data for a CMO.

It is frozen.

A model trained in October 2024 has a knowledge cutoff at that date. Brands that did not exist, or had no meaningful web presence before then, are absent. So is your last quarter of PR. A partnership announced in March 2026 is invisible to a model trained the year before, unless retrieval picks it up at query time. The cutoff date is the model's last memory.

It rewards consistency.

Training data does not memorise pages, it learns patterns. A brand mentioned ten times across ten different trusted sources, consistently described, becomes a strong pattern. A brand mentioned once on a single page barely registers. Five years of being talked about the same way by independent publishers beats one good month of coverage.

It is high confidence when present.

When the model has a strong pattern about your brand from training, it talks about you with conviction. The descriptions are stable across queries. The associations are clear. The placement in the category is unambiguous. This is the position established brands occupy by default, and it is the prize you build over years.

Months to years

How long it takes for a major model's training data to refresh.

Frontier models from OpenAI, Google and Anthropic ship new training runs every few quarters at best. If you waited for training data to catch up, your brand would be effectively invisible for half a year at a time.

Pool 2: Live retrieval, the model's eyes

Live retrieval is what the model fetches in real time when it needs information it does not have, or wants to verify what it does have. Most modern AI products do this. ChatGPT browses through Bing. Perplexity issues its own Vespa queries. Gemini taps Google's index. Each engine has its own retrieval architecture, but the principle is the same: when the model needs current or specific information, it goes and gets it.

Three things matter about live retrieval for a CMO.

It is real time.

A page published yesterday can be cited today. A campaign launched this week can shape AI answers this week. A press release from this morning is fair game. This is the part of GEO that responds quickly. A clever piece on a competitor's site this Friday can move share of voice by Tuesday.

It rewards crawlability and structure.

Retrieval favours pages the engine can read, parse and trust. Solid metadata, schema markup, headings, factual density and content depth all matter here. So does the publisher. A page on a domain the engine already trusts is more likely to be pulled than the same content on a domain it does not.

Confidence is lower than memory.

Retrieval gives the model an input, not a belief. The information surfaces in the answer but with weaker grounding than a training-data pattern. The retrieval citation might disappear from the next session if a different source ranks higher in that moment. This is the volatility that surprises marketing teams used to organic search rankings.

Real time

How fast live retrieval lets your brand appear in AI answers.

If your page is on the live web, indexable, and on a domain the engine respects, the answer can include you today. Retrieval lowers the time-to-citation from months to hours. The bar is structural, not editorial.

How AI decides which pool to use

Models do not choose pools arbitrarily. The decision follows the question.

General knowledge questions lean on training. 'What is Generative Engine Optimisation' pulls almost entirely from memory. The model has seen the concept enough times to talk about it with confidence. Retrieval may add nothing useful.

Specific or recent questions lean on retrieval. 'Who won the New Generation Award for AI in 2025' forces retrieval, because the answer changes year to year and the model knows it cannot trust its memory for it. The whole answer is constructed from whatever crawl results come back.

Brand questions blend both. 'Which AI marketing agencies should I consider in South Africa' pulls memory, the brands the model already associates with the category, and retrieval, current listings, recent press, fresh evidence, then weighs them together. Most CMO-relevant questions sit here.

Each engine differs in how it balances the two. ChatGPT leans more on training data because of how its retrieval is layered in. Perplexity is retrieval-first by design. Claude weighs both. Gemini has direct access to Google's index, which changes the calculation again. The consequence of that fragmentation, with the data, sits in our research piece on why your SEO does not make you visible in ChatGPT.

What this means for your brand

Two pools means two work plans, running in parallel. Skip either one and you ship a half-strategy.

Long-game: feed the training data. Build consistent, repeatable signals across enough trusted sources for the next training run to embed them. Press visibility, expert positioning, Wikipedia accuracy, third-party publication mentions, sameAs and schema work, leadership thought-leadership in stable venues. None of this delivers same-quarter wins. All of it compounds.

Short-game: optimise for live retrieval. Crawlable, well-structured, high-quality content on owned and partner properties. Schema markup on every page that matters. Coverage on the platforms each engine prefers to read. This is the part of GEO that delivers in weeks, not years, and the part most agencies focus on because the wins are visible.

The mistake brands make is choosing one. Pure retrieval optimisation works for new entrants but caps the upside. Pure brand authority work compounds slowly and leaves quick wins on the table. The brands winning GEO in 2026 are doing both at once, and treating them as two halves of the same strategy.

The established brand advantage, and the newer brand opportunity

Established brands have a structural advantage in pool one. Decades of news coverage, Wikipedia entries, partnership pages and reviews accumulate into deep training-data patterns. When the model is asked, it already has a confident view. Standard Bank, Allan Gray, FNB, Old Mutual, Discovery: the model knows them. The work for these brands is keeping the pattern current, not introducing it.

Newer brands have to rely more on pool two. They do not have a decade of distributed third-party signal. What they have is the live web, and the discipline to be present on it consistently. The work is faster, more direct, more measurable, and more competitive, but it is also the only path open.

Both groups have to do both kinds of work eventually. Established brands cannot lean forever on memory the model formed three years ago, because the world updates and the next training run will reflect that. Newer brands cannot stay forever on retrieval, because to get cited with confidence in the long run, they need the pattern to land in memory too. The bridge between the two pools is the GEO work that compounds.

Two pools, two playbooks

1Every AI answer is some blend of training data (memory) and live retrieval (eyes). Knowing which pool is in play for your category is the foundation of GEO strategy.
2Training data is frozen at the cutoff, refreshes every few quarters, rewards multi-year consistent signal, gives the model confidence when present.
3Live retrieval is real-time, crawlable, weighted by source trust. It gives the model evidence, with weaker grounding than memory.
4Established brands benefit structurally from both pools. Newer brands lean more on live retrieval, and need a consistent, crawlable third-party presence to compete at all.
5A serious GEO programme runs two work plans in parallel: the long-game that feeds training data, and the short-game that wins live retrieval. One without the other is a half-strategy.

How Algorithm thinks about this

Algorithm built its GEO methodology around the two-pool model because the alternative does not survive client work. Treating AI as 'the next SEO' misses the structural difference. Treating it as one undifferentiated channel misses the per-engine variation.

Our Lighthouse GEO platform measures both pools, separately, for every client we run. Memory by tracking the consistent description the model carries about the brand across queries. Retrieval by tracking which live sources each engine pulls into the answer in real time. The gap between the two is usually the work plan.

This is what we mean when we say performance architecture, not agency activity. Two pools, two playbooks, one connected strategy. The brands that get this right in 2026 will be the brands AI cites by default in 2027.

The next step

Part 3 of the Understanding GEO series unpacks the difference between citations and mentions. Both move the model. Each does different work. Sign up for the series to get every new part the day it publishes, or book a Lighthouse visibility audit to see exactly which pool is working for your brand and which is not.

Where AI gets its answers: the two data pools that decide your visibility

The two pools, plainly explained

Pool 1: Training data, the model's memory

It is frozen.

It rewards consistency.

It is high confidence when present.

Pool 2: Live retrieval, the model's eyes

It is real time.

It rewards crawlability and structure.

Confidence is lower than memory.

How AI decides which pool to use

What this means for your brand

The established brand advantage, and the newer brand opportunity

How Algorithm thinks about this

The next step

Graeme Stiles

Related Articles

Search, Social, AI: why you need all three (and what we actually sell)

The RACE framework: why we structured our entire business around four letters

Citations vs brand mentions: what they are, and why one without the other costs you

How we turn insight into performance

SEO Agency

GEO Agency

Paid Media

Data & BI

CRO

User Experience

Performance Marketing Agency

Want to discuss this topic?