How to build an LLMO tool

TL;DR

LLMO is GEO with the engineering hat on. Run the same prompts through every LLM separately. Diff the answers. Find where models disagree about your brand. Use those gaps to drive content and outreach. The build: ~120 lines, ~$60 a month.

The LLMs are not interchangeable.

Most GEO and AEO content treats them as a single audience: "AI search." It is a useful framing for executives but a misleading one for engineers. ChatGPT, Claude, Gemini, and Perplexity each have wildly different default behaviors. They use different grounding models, different ranking systems, different citation policies, and different handling of comparison and recommendation queries.

LLMO is the practice of measuring those differences and optimizing per-model. If GEO is "what is my visibility," LLMO is "why does Claude refuse to mention me on queries where ChatGPT happily does." That gap is the most actionable thing in the data.

This post walks through building an LLMO tool from scratch in around 120 lines. The trick is to run the same prompt grid through each model separately and diff the results, instead of rolling up into one score that hides the per-model signal.

What an LLMO tool actually does

Three jobs.

Per-model behavior capture. Run a fixed prompt grid through each LLM separately. Save the full answer plus the structured fields: brands mentioned, citations, ranks, sentiment.

Cross-model diffing. For each prompt, identify where models agree and where they diverge. Models that agree are the consensus view. Models that diverge are where optimization opportunities live.

Per-model gap reports. "On prompt X, Claude does not mention you but Perplexity ranks you #2 and cites linear.app/method as the source." That is the gap report shape. It tells you which content investment closes which gap.

The classic GEO mistake is to optimize for the average. The classic LLMO insight is that the average hides everything interesting. Optimizing for ChatGPT alone is wasteful if you already win there. The leverage is in the model where you currently lose.

Per-model behavioral differences (with sources)

Real, measurable differences across the four major LLMs:

Citation density. Perplexity averages 6-10 cited URLs per answer. Claude averages 0-2. Gemini sits at 3-6. ChatGPT search at 3-5. Source: our own production traffic measurement, n=10,000 calls.

Comparison query handling. Claude refuses or qualifies comparison answers ("X vs Y") significantly more than ChatGPT or Gemini. Result: comparison-heavy categories underweight Claude in their LLMO investments.

Query fan-out. Per Ekamoira's late 2025 research, Gemini silently expands a single user prompt into 3-7 sub-queries and synthesizes across them. ChatGPT and Claude do this less aggressively. Source URLs from the sub-queries show up in the final answer but are invisible to API-only tracking.

API/UI divergence. ChatGPT API and UI answers diverge on 96% of queries (sample n=1,000). Gemini at 88%. Claude at 71%. Perplexity at 42%. The higher the gap, the more important UI-mode tracking becomes for that model. Source: our 1,000-query teardown.

Optimize for the LLM where the gap to your competitors is largest, not the LLM with the most users. The leverage is in the gap, not the volume.The LLMO investment thesis

Get the per-model data layer

MentionsAPI exposes each LLM separately so you can diff them. ChatGPT-live, Claude (API only currently), Gemini-live, Perplexity-live. PAYG wallet from $10.

Get API key See the multi-LLM API

The build

Build the test prompt grid

Aim for 30 prompts that span three intent buckets: definition, comparison, and how-to. Definition prompts surface basic awareness behavior. Comparison surfaces ranking behavior. How-to surfaces citation behavior.

prompts.json

{
  "brand": "Linear",
  "competitors": ["Jira", "Asana", "ClickUp", "Notion"],
  "prompts": {
    "definition": [
      "what is Linear used for",
      "is Linear a project management tool",
      "what is Linear good for"
    ],
    "comparison": [
      "Linear vs Jira",
      "Linear vs Asana for engineering",
      "best project management tool for engineers"
    ],
    "how_to": [
      "how to migrate from Jira to Linear",
      "how to set up Linear for a startup",
      "how does Linear's keyboard navigation work"
    ]
  }
}

Run each prompt through every LLM

Call /v1/check with each per-provider mode. The available modes are chatgpt_live, gemini_live, perplexity_live for the UI-scrape variants, plus the standardquick for API-only.

run-grid.mjs

import config from "./prompts.json" assert { type: "json" };

const MODES = ["chatgpt_live", "gemini_live", "perplexity_live", "quick"];
const ALL_PROMPTS = Object.values(config.prompts).flat();

async function runOne(prompt, mode) {
  const res = await fetch("https://api.mentionsapi.com/v1/check", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.MENTIONSAPI_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      mode,
      query: prompt,
      track_brands: [config.brand, ...config.competitors],
    }),
  });
  return res.json();
}

const grid = {};
for (const prompt of ALL_PROMPTS) {
  grid[prompt] = {};
  for (const mode of MODES) {
    grid[prompt][mode] = await runOne(prompt, mode);
  }
}
console.log(JSON.stringify(grid, null, 2));

Note: Claude UI scraping is not yet shipped on MentionsAPI (Q3 2026 target due to Claude.ai session expiry under load). Use quick mode for the Claude API view in the meantime.

Diff the per-model behavior

For each prompt, compute three things: which models mention your brand, what rank, and which models cite which URLs. The gap report is the cross-product.

diff.mjs

function buildDiffReport(grid, brand) {
  const report = [];
  for (const [prompt, modes] of Object.entries(grid)) {
    const row = { prompt, byModel: {} };
    for (const [mode, data] of Object.entries(modes)) {
      const me = data.brands?.find(b => b.name === brand);
      row.byModel[mode] = {
        mentioned: !!me?.mentioned,
        rank: me?.rank ?? null,
        cited_url: data.citations?.find(c => c.url.includes(brand.toLowerCase()))?.url ?? null,
      };
    }
    // Find the model with the most favorable result
    const favorable = Object.entries(row.byModel).filter(([, v]) => v.mentioned).sort((a, b) => (a[1].rank ?? 99) - (b[1].rank ?? 99))[0]?.[0];
    const hostile = Object.entries(row.byModel).filter(([, v]) => !v.mentioned).map(([k]) => k);
    row.favorable_model = favorable;
    row.hostile_models = hostile;
    report.push(row);
  }
  return report;
}

console.table(buildDiffReport(grid, "Linear"));

Iterate based on the gap report

For each prompt where one model is favorable and another is hostile, study the citations on the favorable model. Those URLs are the authoritative sources for that prompt on that model. If you can get cited by them, the hostile model starts mentioning you too (eventually, after re-grounding).

The full feedback loop: read the gap report weekly, pick the 3 most actionable gaps, run outreach against the citation list of the favorable model, re-measure after 6-12 weeks. Repeat.

The signal: when a hostile model starts mentioning you on a prompt you used to lose, your outreach worked. That is the LLMO win condition.

What this costs

30Prompts in a typical LLMO grid

4Models compared per prompt

$60Monthly cost at weekly grid runs

6-12 wksTime for content outreach to land in citations

30 prompts × 4 modes × $0.30 average = $36 per grid run. Run weekly: $144 a month at full price. Cache hits drop that to $50-$70 in practice. Add 10 more prompts and you are still under $100 a month.

Why this matters more than the rollup score

One specific pattern shows up in real data. A brand has 50% overall AI visibility (decent). The breakdown reveals 80% on ChatGPT, 70% on Perplexity, 30% on Claude, 20% on Gemini. The average (50%) is misleading. The Claude and Gemini gaps are where almost all the optimization headroom lives.

A team that optimizes for the overall score will spend evenly across surfaces. A team that runs an LLMO loop will pour 80% of its content investment into Claude and Gemini gaps and leave ChatGPT alone (where they already win). The second team will see twice the lift on the same budget.

The trap: reporting LLMO results to executives. They want one number, not a 4-model gap matrix. Solution: keep the gap matrix as the engineering tool. Surface a rollup visibility score for the deck. Use the matrix to drive decisions, the score to communicate progress.

Frequently asked questions

What is LLMO and how is it different from GEO?

LLMO stands for LLM Optimization. It is the engineering-first lens on the same problem GEO addresses, but focused on per-model behavior rather than rolled-up scores. LLMO asks "how does each LLM treat my brand differently and why" rather than "what is my overall visibility." The same data layer powers both; the rollup and the audience are different.

Why measure per-model behavior instead of just averaging?

Because the LLMs are not interchangeable. Claude tends to refuse comparison questions; ChatGPT happily answers them. Perplexity cites 6-10 URLs per answer; Claude often cites 0-2. Gemini does aggressive query fan-out into 3-7 sub-queries; the others do not. Averaging across models hides actionable differences. Per-model analysis is where the optimization signal actually lives.

What kind of optimizations does an LLMO tool inform?

Three categories. (1) Content optimization: writing pages that match how each model expects to answer. (2) Citation optimization: getting cited by the URLs each model treats as authoritative for your topic. (3) Prompt optimization: for teams running their own LLM-powered features, choosing the model whose default behavior aligns with your goal.

Can I run an LLMO loop without all four LLMs?

You can, but you should not. The signal in LLMO comes specifically from the differences across models. Running with one model gives you SEO-equivalent data on one surface. Running with four models is what makes it LLMO. MentionsAPI handles all four behind one endpoint so the multi-model cost is the same as the multi-call cost on a single model.

How is this different from running a benchmark?

A benchmark tests the model on standardized prompts to score its capability. An LLMO loop tests how each model treats your specific brand on prompts your customers actually use. Benchmarks are about model capability; LLMO is about model behavior toward you. Different goal, different methodology.

Should I optimize for the model with the most users or the most citations?

Most users for top-of-funnel awareness work. Most citations for authority and traffic work. ChatGPT has the largest user base. Perplexity has the highest citation rate per answer. If forced to pick one, optimize for both by spending where the gap to your competitors is largest, not where the absolute number is largest.

Ship it next week

Day one: build the prompt grid and the per-model run loop. Day two: build the diff and gap report. Day three: pick three priority gaps and start outreach against their citation lists. Day forty-five: re-run the grid and check whether the hostile models flipped.

LLMO is slower than rank tracking but it is more actionable. Each cycle teaches you something specific about how each model treats your category. Six cycles in, you know which surfaces you can win cheaply and which require enterprise-level content investment.

That is the kind of intel SaaS GEO tools cannot give you because they aggregate the data away.

Nikhil Kumar

Founder, MentionsAPI

Growth marketer at the intersection of marketing, product, and technology. 8+ years across startups and scale-ups in India, Switzerland, and the Netherlands. Founder of Landkit (landkit.pro).

Twitter GitHub LinkedIn