Information gain vector audit: stop publishing redundant content
Information gain is the new dominant ranking signal. Audit your content inventory for cosine similarity to the SERP centroid, prune the redundant, inject orthogonal data. Q2 2026 playbook with Cubitrek client case.


The era of “10x Content” is effectively over. The new algorithmic imperative is Information Gain.
For years, the standard content strategy was straightforward: scrape the top 10 search results, aggregate their headings, refine the prose, and enhance the graphics. In the age of LLMs and semantic search, this strategy is now a liability.
Google’s ranking systems and modern Answer Engines (like SearchGPT or Perplexity) do not value “better” versions of the same information. They value novelty. If your content vector aligns too closely with the existing consensus, you are mathematically classified as “redundant.”
This article breaks down the engineering reality behind Google’s “Information Gain” patent concepts and provides a mathematical framework for auditing your content inventory.

The Engineering Reality: From Keywords to Vector Space
To understand why your “comprehensive guide” isn’t ranking, you must understand how modern search engines parse relevance. They no longer just match keywords; they map content into High-Dimensional Vector Space.
Every piece of content is converted into a vector embedding a numerical representation of its semantic meaning.
This is also why mitigating hallucination with structured content has become an engineering requirement. Without explicit structure, LLMs interpolate meaning from noisy embeddings, increasing the probability of misinterpretation and synthetic inaccuracies.
- The Consensus Cluster: For any given query (e.g., “SaaS churn benchmarks”), the current top-ranking pages tend to cluster together in vector space. They cover the same definitions, cite the same three studies, and offer the same best practices.
- The Centroid: We can calculate the “centroid” (the geometric centre) of this top-ranking cluster. This represents the “average knowledge” currently available on the topic.
The Algorithm’s View
When you publish a new article that effectively rewrites the top 10 results, your content’s vector embedding lands almost exactly on top of that centroid.
From an engineering standpoint, your Cosine Similarity to the existing results is nearly 1.0 (or 100%).

Here is the harsh reality: If your content has high cosine similarity to the consensus, it adds zero entropy to the system. It is redundant. In a retrieval-aug
mented generation (RAG) environment, the AI will simply prune your document because it offers no new tokens to generate an answer.
Decoding the "Information Gain" Patent
Google’s research into “Information Gain” (referenced in patents such as US20200349169A1 concerning contextualising content) explicitly targets this redundancy.
The goal of the scoring system is to determine whether a user has already consumed Document A and, if so, how much new knowledge they gain by reading Document B.
If your content effectively mirrors the semantic footprint of the existing search engine results page (SERP), your Information Gain Score is negligible.
The Equation of Redundancy
We can conceptualise the “Information Gain Penalty” as follows:

- Relevance: How well you answer the prompt (Standard SEO).
- Similarity: How closely you mimic the competition.
Most content teams maximise relevance but overlook similarity. If your Similarity is 0.99, your Information Gain Score approaches zero. You are effectively invisible to the ranking algorithm because you are offering a “duplicate vector.”
The Audit: Are You Publishing Noise?
To audit your content strategy, you need to stop asking “Is this well-written?” and start asking “Does this change the vector?”
A true Information Gain Audit evaluates your planned content against three “Data Vectors”:
- The Entity Vector
Does your content introduce new named entities (people, proprietary tools, specific locations, new frameworks) that do not appear in the top 10 results?
- Redundant: Mentioning “HubSpot” in a CRM article.
- Gain: Introducing a new proprietary metric like “Customer Velocity Rate.”
- The Data Vector
Are you citing the same stats as everyone else?
- Redundant: Citing the 2021 McKinsey report everyone else links to.
- Gain: Publishing exclusive N=500 survey data that contradicts the McKinsey report.
- The Perspective Vector
Is the sentiment and structural logic identical?
- Redundant: “5 Ways to Improve Retention” (Listicle format, positive sentiment).
Gain: “Why Retention Strategies Fail” (Diagnostic format, critical/contrarian sentiment).
And in the AI era, this is directly tied to measuring brand presence with SOM, because visibility is no longer about rankings alone, but about how often your brand appears inside model-generated answers.
The Strategic Pivot: Budgeting for Vector Injection
This is where the Content Strategist must pivot the conversation with finance and leadership.
Cheap content (AI-generated or low-cost freelance) is a “Mean Reversion” machine. LLMs are trained to predict the most likely next word, which effectively means they are designed to produce the average of all human knowledge. If you use AI to write your core content without heavy modification, you are mechanically generating a vector that sits perfectly on the centroid. You are paying for mediocrity.
The Business Case for Original Research
To achieve a high Information Gain Score, you must force the vector to move orthogonal to the consensus. The only reliable way to do this is Original Research.
When you commission a survey, interview subject matter experts (SMEs), or release proprietary internal data, you are essentially buying a New Data Vector.And in the AI era, this is directly tied to measuring brand presence with SOM, because visibility is no longer about rankings alone, but about how often your brand appears inside model-generated answers.
- Old Pitch: “We need $5,000 for a whitepaper because it establishes thought leadership.”
- New Pitch: “We need $5,000 for original research because currently, our content is mathematically indistinguishable from our competitors. The search algorithms are pruning our pages because they have a high semantic overlap with existing results. This budget allows us to inject novel data points, lowering our cosine similarity score and triggering the Information Gain boost.”
Q2 2026 update: information gain is now the dominant ranking signal
Two shifts since this post first published in early 2026:
- Google's March 2026 core update doubled down on the information-gain scoring. Sites that ship reskinned versions of competitor content saw 20-40% organic-traffic drops. Sites publishing first-person operator data, original benchmarks, or proprietary research either held flat or gained. The vector-similarity penalty became sharper, not softer.
- AI engines (ChatGPT, Perplexity, Gemini 2, Claude 4) now explicitly cite the most-orthogonal source in synthesised answers. When five sources all say the same thing about a topic, the AI cites the one that adds something the other four did not. That citation drives 3-4x higher conversion than cold Google traffic, which means information gain now compounds across both classic search AND AI-cited search.
The strategic implication: if you have a content backlog from 2024-2025 that was written to "out-comprehensive" competitors, every page on that backlog is a liability. Audit, prune, or re-vector.
Case study: Cubitrek client, information gain audit drives a 4x AI-citation lift
A B2B SaaS client in the marketing automation category came to us in Q4 2025 with a familiar problem: 80 published blog posts, top-5 Google ranks on the head terms, but zero citations across ChatGPT, Perplexity, or Gemini when buyers asked any question in their category.
We ran an information gain audit on all 80 posts. The pattern was unmistakable:
- 62 posts had cosine similarity ≥ 0.92 to the top-10 SERP for their target query (mathematically redundant)
- 14 posts had cosine similarity 0.7-0.9 (partially novel, mostly recycled)
- Only 4 posts carried meaningful information gain (proprietary data or first-person operator insight)
The intervention (3 months):
- Killed 28 of the 62 redundant posts outright (301-redirected to closest non-redundant peer)
- Re-vectored 34 of the redundant posts by injecting one unique data point each (proprietary benchmark, named operator example, or contrarian framing)
- Commissioned three rounds of original research (N=200 marketing-ops survey, internal product-usage benchmark, customer interview series)
Results after 3 months:
Information gain audit results
The Google lift was a bonus — the AI citation lift was the headline. The same orthogonality that earned citations inside ChatGPT also pulled the retained posts away from the SERP-cluster centroid, which is what the March 2026 core update started rewarding directly.
Tooling: how to actually run an information gain audit
Three steps any team can execute:
- Pull the top 10 SERP for each target query. Use Ahrefs, Semrush, or DataForSEO. Scrape the body text.
- Embed each competitor page and your draft. OpenAI's
text-embedding-3-largeor Voyage'svoyage-large-2work. Calculate cosine similarity between your draft and the centroid of the top-10. - Set a threshold. Below 0.85 cosine similarity is good (genuinely novel content). 0.85-0.92 is borderline (inject more proprietary data). Above 0.92 is a liability (rewrite or kill).
The Cubitrek AEO Platform automates this audit across an entire content inventory. It flags every redundant page and surfaces the specific vectors that need injection (entity, data, or perspective). Free tier returns your top-50 page scores in 90 seconds.
Frequently asked questions
1) What is information gain in SEO?
Information gain is the measurable novelty your content adds to a topic relative to what already ranks for the same query. Google's "Information Gain Score" (referenced in patent US20200349169A1) compares the semantic vector of your page against the centroid of the existing top-10 results. The higher the orthogonality (lower cosine similarity), the higher the gain. As of 2026, gain is one of Google's strongest ranking signals, and the primary signal AI engines use to decide which source to cite when multiple results overlap.
2) How do you measure information gain on a single page?
Embed your page and the top-10 SERP results for your target query using a modern embedding model (OpenAI text-embedding-3-large, Voyage voyage-large-2, or Cohere Embed v3). Calculate the cosine similarity between your page vector and the average of the top-10 vectors. Below 0.85 is genuinely novel. Above 0.92 is redundant. The Cubitrek AEO Platform automates this at scale.
3) Does AI-generated content always have low information gain?
Yes, by default. LLMs are trained to predict the most likely next token, which means they reproduce the consensus by design. Pure AI-generated content lands almost exactly on the SERP centroid. To make AI-drafted content gain-positive, you have to inject orthogonal data manually: proprietary stats, named operator quotes, contrarian framing, or original research. AI as a production tool is fine. AI as the entire content strategy is a liability under the March 2026 core update.
4) How is information gain different from E-E-A-T?
E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) measures who is publishing the content. Information gain measures what the content adds. The two reinforce each other: high-E-E-A-T authors are more likely to publish gain-positive content (because they have firsthand data and operator experience). But a high-E-E-A-T author who publishes a generic listicle still scores low on information gain, and Google now demotes it accordingly.
5) What is the fastest way to add information gain to existing content?
Three moves, ranked by impact: (1) inject original first-person data (your own funnel metrics, your own customer interviews, your own A/B test results), (2) introduce a proprietary framework or named metric the SERP centroid does not use, (3) take a contrarian stance backed by reasoning, even when the consensus is correct. Cosmetic edits (better intros, more images, longer paragraphs) do not move the vector.
6) Should I delete low-information-gain pages or rewrite them?
If the page has zero backlinks and no Google traffic, delete and 301-redirect to a closer non-redundant peer. If the page has historical traffic or backlinks, rewrite it: inject one or two orthogonal data points and bump the updatedAt. The 28 posts we killed for the client above were all backlink-zero, traffic-zero pages. The 34 we re-vectored had historical equity worth preserving.
Cross-link to the rest of the AI-search cluster: robots.txt 2026 for AI crawler budgets, nested JSON-LD for GraphRAG, hybrid search optimization, header architecture for vector proximity, sentiment drift analysis.
Conclusion
In the age of AI, redundancy is the primary failure mode.
If you cannot mathematically prove that your content adds new information to the corpus, you shouldn't publish it. The Information Gain Score is not just a patent metric; it is the new definition of quality. The March 2026 core update made that explicit. Stop paying for words. Start paying for new vectors.
Key takeaways
- Audit your content inventory for cosine similarity to the top-10 SERP. Anything above 0.92 is a liability.
- AI-generated content lands on the SERP centroid by design. Inject proprietary data or kill the page.
- Original first-person research is the highest-leverage information gain move. One survey > 10 generic blog posts.
- Pair information gain with a Brand Hub plus llms.txt so the AI knows which novel data point came from your brand.
- Track information gain alongside Google rank. The Cubitrek AEO Platform automates the cosine-similarity scoring at scale.

Faizan Ali Khan
Founder of Cubitrek. Ships agentic AI systems that automate sales, marketing, and operations for SaaS, e-commerce, and real estate companies. Coined the term 'single-player agency' in 2026.
Questions people ask about this
Sourced from client conversations, Search Console, and AI-search citation monitoring.
- To audit your content strategy, you need to stop asking “Is this well-written?” and start asking “Does this change the vector?”
Related articles.
More on the same thread, picked by tag and category, not chronology.

AEO vs GEO vs SEO: The Triangle
SEO is the foundation. AEO is the snippet game. GEO is the synthesis game. They are not competitors. Run them as one program and they compound.


Norway’s IT Skills Gap: Why More Tech Leaders Are Turning to Flexible Talent Models
Norway’s digital economy is growing fast, but many companies are struggling with one thing they cannot easily buy: experienced IT professionals.


AEO 101: The Definitive Guide to Answer Engine Optimization in 2026
Search trends have changed so drastically that they cannot be reversed. For more than two decades, search was centred around “blue links”, a list of options presented to users, who then had to click,

The AI-first growth memo.
One email every other Tuesday. What's moving across AI search, paid, and agentic AI, with the playbooks attached.
No spam. Unsubscribe in one click.
Want Cubitrek to run AEO & GEO for you?
We install aeo & geo programs for growing companies across the US and Europe. Book a call and we'll come back with a one-page plan in 72 hours.
