Retrieval Eligibility: The Pre-Ranking Layer Nobody Is Measuring

Rhasaun CampbellMay 16, 202611 min read

retrieval-eligibilitygeoai-visibilityseo

Retrieval eligibility is whether your page is admitted into the candidate set that ranking algorithms and AI engines actually evaluate. It sits one layer below ranking, and for most of the last decade it was invisible to SEO tooling because the candidate window was small enough that "indexed" was a reasonable proxy for "eligible". That proxy is breaking.

I have been telling customers for the better part of a year that ranking position is the wrong unit of analysis for the next era of search. The receipts for that argument arrived earlier this month in a Search Engine Land piece by Martin Jeffrey, which pulled the most important admission of the decade from federal court transcripts. The case is US v. Google. The witness is Pandu Nayak, Google's VP of Search. The exchange runs four questions long.

Q: RankBrain looks at the top 20 or 30 documents and may adjust their initial score, correct? A: That is correct.

Q: RankBrain is an expensive process to run? A: It's certainly more expensive than some of our other ranking components.

Q: So that's, in part, one of the reasons why you just wait until you're down to the final 20 or 30 before you run RankBrain? A: That is correct.

Q: RankBrain is too expensive to run on hundreds or thousands of results? A: That is correct.

Four straight confirmations under oath. The deep-learning layer of Google ranking, the layer the entire SEO industry has built theory around, is deliberately withheld from the bulk of Google's index because Google cannot afford to apply it more widely. It runs on the final 20 to 30 pages. The rest of the corpus is culled to that window by classical retrieval before any modern signal touches it.

That ceiling held for a reason. The reason is about to change. This article is about what that means and what to do about it.

What Is Retrieval Eligibility?

Retrieval eligibility is admission. A page that is indexed but never enters a candidate set for any meaningful query is invisible. It can be perfectly written, technically sound, and rich with entities, and still earn zero traffic and zero AI citations. Ranking only matters once you are eligible.

There are two layers most rank trackers conflate. Layer one: is your page pulled into the candidate pool at all? Layer two: where do you sit within that pool? Most SEO tools measure layer two. Almost none measure layer one, because for years the layer-one decision was opaque and the candidate window was narrow enough that the question rarely surfaced.

This is true for traditional Google ranking. It is even clearer for AI retrieval. When ChatGPT, Perplexity, Claude, or Google AI Overviews answers a question, the system does not rank a result list. It retrieves a candidate set, evaluates the pages in that set, and synthesizes an answer from the strongest sources. Pages outside the set never had a chance.

Optimization that ignores the retrieval layer is optimization on the wrong axis.

Why the Google Ranking Window Is 20 to 30 Pages Wide

The 20 to 30 number is a budget, not a property of the algorithm. Google's hardware can afford to run deep-learning evaluation on 20 to 30 pages per query at Google scale. Run it on 200 pages and the bill goes up by an order of magnitude. Run it on the entire postings-list output and Google stops working as a business.

Here is the architecture Nayak described to Judge Mehta in the same testimony. Google starts with classical postings-list retrieval, walking inverted indexes for the terms in the query. The corpus gets culled to "tens of thousands" of pages. From that pool, only the top 20 to 30 reach the deep-learning reranker.

The constraint is hardware economics, and the hardware has limits no amount of capital spend can solve in the short term. On April 7, Sundar Pichai sat down on the Cheeky Pint podcast with John Collison and Elad Gil and named five concurrent supply constraints: foundry wafer starts, memory, power and energy, data center permitting, and skilled labor. The line he pressed hardest on was memory. In his words, there is no way the leading memory companies are going to dramatically improve their capacity in the next 12 to 24 months. Higher prices do not create more capacity.

Nearest-neighbor vector search, the mechanism behind every modern semantic retrieval pipeline, is memory-bound. The wider the candidate set, the more memory you need. The cost of widening the candidate window has held the window in place.

How TurboQuant Changes the Math

Two weeks before that Pichai interview, Google Research published TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. The headline numbers from the paper:

4x to 4.5x compression of vector representations with performance comparable to unquantized models on the LongBench benchmark
Nearest-neighbor search indexing time reduced to "virtually zero"
Outperforms existing product quantization techniques on recall

The paper covers two applications. The first is KV-cache compression inside Gemini. The second is nearest-neighbor search in vector databases. The second one is what matters for retrieval.

If indexing is virtually free and memory per vector drops by a factor of four, the cost economics that capped the candidate window at 20 to 30 do not apply anymore. The same hardware that could evaluate 20 to 30 pages can now plausibly evaluate several times that. Google has not confirmed TurboQuant is deployed in production search. Google already runs ScaNN, the predecessor to TurboQuant, in production. TurboQuant extends that work. The infrastructure is published. Deployment is now an engineering question.

The question for everyone else has shifted. It is no longer whether the cost boundary can be moved. It is what to do before it moves.

What Changes for Classical Google SEO When the Window Widens

Pages currently sitting outside the top 30 become competitors. Sites that ranked positions 11 to 50 for a query, currently invisible, become eligible for the deep-learning reranker. The deep-learning reranker is what understands intent. It is what rewards content built around semantic depth instead of keyword density. The reranker is, on average, kinder to pages that earn citations from AI models than the classical retrieval layer is.

The implication: pages built for the next five to ten years of search (entity-rich, citation-friendly, structurally clear) start winning when the window widens. Pages built to game the current top-30 window start losing relative ground. Briefing content against "what ranks in positions 1 to 10 today" is briefing against a snapshot of a window that is narrower than it needs to be.

This is the SEO and GEO overlap. About 80% of what makes a page eligible for a wider candidate set is the same work that makes a page citation-friendly to AI engines: clear entity signaling, structurally clean content, technical health, real topical authority. The remaining 20% is where GEO has its own discipline, and it lives almost entirely in the retrieval layer. If you have read the GEO pillar or the GEO vs SEO comparison, this is the structural reason the overlap is so high. The same forces that earn AI citations are about to earn better Google rankings.

Retrieval Eligibility in AI Search

AI retrieval pipelines have always operated with retrieval eligibility as the front gate. The constraint there is different from Google's. Google was limited by the cost of running the reranker. AI engines are limited by whether they have crawled your site at all. For most sites I audit, the AI-side gate is the entire game.

When you ask Perplexity, ChatGPT, or Claude a question, the system decomposes your prompt into sub-queries. This is called query fan-out, and I covered the mechanics in What Is GEO in Marketing?. It retrieves a candidate set from its index, built by its own dedicated retrieval bots. It evaluates the candidates. It synthesizes an answer from the strongest pages with citations.

Pages those retrieval bots have not fetched cannot be cited. Pages those bots have fetched but did not classify as authoritative get skipped. Pages with vague H2s, no clear answer in the first 100 words, no entity signaling, and no structured data become invisible by elimination at the retrieval stage, well before the synthesis layer ever looks at them.

The AI Bot Taxonomy You Should Be Tracking

Two classes of bot matter. The first builds the candidate corpus AI systems pull from. The second fetches pages on demand when someone asks an AI model about a topic your page covers.

Index crawlers worth tracking right now:

OAI-SearchBot: ChatGPT search index
Claude-SearchBot: Claude search index
PerplexityBot: Perplexity index
Applebot: Apple search and Apple Intelligence
GPTBot: OpenAI training data corpus
Google-Extended: Google AI products opt-in crawler
CCBot: Common Crawl, used by multiple AI training pipelines
Bytespider: ByteDance / Doubao retrieval

User-driven agents that fetch on demand:

ChatGPT-User: ChatGPT live browsing
Claude-User: Claude live browsing
Perplexity-User: Perplexity live browsing

User-driven agents do not execute JavaScript. They make an HTTP request, parse the response, and move on. Every visit they make to your site is invisible to GA4, Adobe Analytics, Plausible, Fathom, and every other client-side analytics tool you use. If you cannot see them, you cannot tell whether your retrieval-side optimization is working.

The bot list moves. New AI providers ship new bots. Existing bots change their user-agent strings. Maintenance is real. If you build the tracking yourself, the maintenance becomes yours.

How to Measure Your Retrieval Coverage Today

The fastest check is server log analysis. Pull your access logs for the last 30 days. Filter for hits whose user-agent string matches one of the bots above. Count distinct page paths that have at least one matching hit. Divide by the count of canonical pages on your site. The percentage is your retrieval coverage.

For most sites I audit, including sites that rank well on classical Google, retrieval coverage starts low. Below 30% is typical. Below 10% is common. Pages with strong rankings and zero AI bot visits show up regularly, and those pages are leaking the most opportunity.

Four ways to capture this signal, in order of effort:

Tail your existing access logs. Nginx, Apache, Caddy, CloudFront, ALB. Free, immediate, no integration. The trade-off: someone has to grep, parse, deduplicate, and join against your URL inventory. This becomes manual work fast.
Cloudflare Worker or equivalent edge function. Capture matched bot visits in real time and forward them to a dashboard. Lower latency than log parsing, no impact on origin. Requires an edge layer.
Server-side analytics with bot detection. Some platforms (Plausible Bots, Tinybird user-agent parsing, custom Snowplow event setups) can be configured to capture bot traffic specifically. Usually built for one analytics use case, so retrieval coverage is a side effect.
A purpose-built retrieval tracker. This is what we are building inside IndexMind right now. The Retrieval Bot Tracker uses an edge collector or server log adapter, automatic bot classification, and surfaces retrieval coverage as a first-class metric. Bob (our Agent OS) proactively flags high-impression pages with zero AI bot coverage as priority fixes. Live in alpha on getwrecked.com, shipping to IndexMind customers this quarter.

Why GA4 Cannot Show You This

GA4 fires when a page renders in a browser and JavaScript executes. AI retrieval bots do not execute JavaScript. From GA4's perspective, those visits never happened.

For sites whose primary value is content, this means GA4 systematically underreports the audience that determines whether you get cited in AI answers. Pages getting fetched by twelve different AI bots and recommended by Perplexity for queries in your category look identical in GA4 to pages that are invisible to every AI engine. The signal that matters for the next era of search lives outside the analytics layer most teams trust.

This is the gap behind the AI Attribution and Outcomes framing we use inside IndexMind. GA4 is necessary and insufficient. AI referral attribution needs server-side capture. Bot fetch capture needs server-side capture. The entire class of signal that matters for retrieval eligibility lives outside the GA4 layer. When a customer says "we are good, we have GA4", that is the moment to share this article. They are blind to the retrieval layer entirely.

Retrieval Eligibility Is a Different Audit From Ranking Eligibility

Ranking signals you already know: topical authority, link equity, query-intent match, technical health. Retrieval systems look for something more specific. They want a clear, self-contained, citable claim that can be extracted and evaluated without reading the entire document.

When I audit a page for retrieval eligibility, here is the checklist I run:

Is the primary claim in the first 100 words?
Is that claim tied to a verifiable entity, statistic, or named source?
Are the H2s phrased as questions a real user would ask an AI engine, or are they decorative ("Our Approach", "Why It Matters", "The Solution")?
Does each section answer one specific question?
Is there an FAQPage schema with concise, citable Q and A pairs?
Is the page free of buried-lede paragraphs that delay the answer past the third or fourth screen?
Can you screenshot the first 300 words and have it function as a standalone answer?

A page written purely for ranking often buries its main claim under preamble, context-setting, and caveats. A retrieval-ready page leads with the claim and earns trust with evidence afterward. The audits overlap, and they are different audits.

This is the work IndexMind's content audit skill enforces today, and it is the work we are tightening in the next release with a dedicated first-100-words citability check.

What to Do Before the Window Widens

Three actions, in order:

Measure your retrieval coverage today. Run the log analysis above or install a tracker. Get a number before you start optimizing. The baseline is the most important diagnostic.
Identify your highest-impression uncovered pages. Join your retrieval coverage data with GSC impression data. Sort by impressions descending. Filter to pages with zero AI bot visits. These are the pages with the most ranking demand and the least AI eligibility. They are the priority queue.
Audit the priority queue for retrieval-friendliness. Use the checklist above. Fix the issues. Resubmit through sitemap, link from authoritative internal pages, and verify return visits over the following weeks.

The window is moving. Pages you fix today land inside the wider candidate set when it arrives. Pages you ignore stay outside it. The competitive surface shifts before rank trackers can register the change, so the work has to happen ahead of the visible signal.

Frequently Asked Questions

What is retrieval eligibility in SEO?

Retrieval eligibility is whether a page is admitted into the candidate set that ranking algorithms or AI retrieval pipelines evaluate. A page that is indexed but never enters a candidate set is invisible regardless of how well-optimized it is. Retrieval eligibility sits one layer below ranking and is determined by whether retrieval bots fetch and index the page, and whether the page's structure and content match what those systems prioritize.

How is retrieval eligibility different from indexation?

Indexation means Google or another search system has your page in its index. Retrieval eligibility means your page is admitted into the candidate set for a specific query. A page can be indexed and still never retrieved for any meaningful query, in which case indexation is a hollow signal. AI engines add another layer: even when your page is in their index, if their retrieval bots have not recently fetched it, citation eligibility is at risk.

Why is the Google ranking window only 20 to 30 pages?

Court testimony from Pandu Nayak in US v. Google confirmed Google's RankBrain deep-learning reranker is deliberately applied only to the final 20 to 30 candidate pages because it is too expensive to run on the full retrieved corpus. The constraint is hardware economics. Google Research's TurboQuant paper from March 2026 describes vector quantization that reduces these costs substantially, suggesting the window could widen meaningfully in the near future.

How do I know if AI bots are crawling my site?

The fastest method is server log analysis. Filter your access logs for user-agent strings matching known AI retrieval bots: OAI-SearchBot, Claude-SearchBot, PerplexityBot, Applebot, ChatGPT-User, Claude-User, Perplexity-User. Count distinct page paths these bots have visited. Pages they have not visited cannot be cited in AI answers regardless of how well they rank in Google. Purpose-built tools like IndexMind's Retrieval Bot Tracker automate this capture and join the data with GSC impressions for prioritization.

Will GA4 show me AI bot traffic?

No. AI retrieval bots do not execute JavaScript, so they do not fire GA4 tags. From GA4's perspective those visits never happened. Server-side capture is the only way to see this traffic. This is why retrieval eligibility tracking sits outside the standard analytics stack and requires either log analysis or an edge collector. Relying on GA4 alone leaves you blind to the retrieval layer entirely.

We ran this article through IndexMind's own AI visibility and retrieval-eligibility scoring before publishing. We practice what we preach. The Retrieval Bot Tracker described above is live in alpha on getwrecked.com, our live test environment, and ships to IndexMind customers this quarter. If you want to see your own retrieval coverage number, the free tier of IndexMind includes the baseline diagnostic.

IndexMind analyzes how ChatGPT, Perplexity, and other AI engines perceive your website, then helps you fix what is holding back your citations and visibility. We built it because we needed it ourselves. Every feature ships on getwrecked.com before it ships to you.

Ready to see how AI sees your business?

Measure your AI visibility, track citations, and get actionable recommendations.