The LLMs are not interchangeable.
Most GEO and AEO content treats them as a single audience: "AI search." It is a useful framing for executives but a misleading one for engineers. ChatGPT, Claude, Gemini, and Perplexity each have wildly different default behaviors. They use different grounding models, different ranking systems, different citation policies, and different handling of comparison and recommendation queries.
LLMO is the practice of measuring those differences and optimizing per-model. If GEO is "what is my visibility," LLMO is "why does Claude refuse to mention me on queries where ChatGPT happily does." That gap is the most actionable thing in the data.
This post walks through building an LLMO tool from scratch in around 120 lines. The trick is to run the same prompt grid through each model separately and diff the results, instead of rolling up into one score that hides the per-model signal.
What an LLMO tool actually does
Three jobs.
Per-model behavior capture. Run a fixed prompt grid through each LLM separately. Save the full answer plus the structured fields: brands mentioned, citations, ranks, sentiment.
Cross-model diffing. For each prompt, identify where models agree and where they diverge. Models that agree are the consensus view. Models that diverge are where optimization opportunities live.
Per-model gap reports. "On prompt X, Claude does not mention you but Perplexity ranks you #2 and cites linear.app/method as the source." That is the gap report shape. It tells you which content investment closes which gap.
Per-model behavioral differences (with sources)
Real, measurable differences across the four major LLMs:
Citation density. Perplexity averages 6-10 cited URLs per answer. Claude averages 0-2. Gemini sits at 3-6. ChatGPT search at 3-5. Source: our own production traffic measurement, n=10,000 calls.
Comparison query handling. Claude refuses or qualifies comparison answers ("X vs Y") significantly more than ChatGPT or Gemini. Result: comparison-heavy categories underweight Claude in their LLMO investments.
Query fan-out. Per Ekamoira's late 2025 research, Gemini silently expands a single user prompt into 3-7 sub-queries and synthesizes across them. ChatGPT and Claude do this less aggressively. Source URLs from the sub-queries show up in the final answer but are invisible to API-only tracking.
API/UI divergence. ChatGPT API and UI answers diverge on 96% of queries (sample n=1,000). Gemini at 88%. Claude at 71%. Perplexity at 42%. The higher the gap, the more important UI-mode tracking becomes for that model. Source: our 1,000-query teardown.
Optimize for the LLM where the gap to your competitors is largest, not the LLM with the most users. The leverage is in the gap, not the volume.The LLMO investment thesis
The build
Aim for 30 prompts that span three intent buckets: definition, comparison, and how-to. Definition prompts surface basic awareness behavior. Comparison surfaces ranking behavior. How-to surfaces citation behavior.
{
"brand": "Linear",
"competitors": ["Jira", "Asana", "ClickUp", "Notion"],
"prompts": {
"definition": [
"what is Linear used for",
"is Linear a project management tool",
"what is Linear good for"
],
"comparison": [
"Linear vs Jira",
"Linear vs Asana for engineering",
"best project management tool for engineers"
],
"how_to": [
"how to migrate from Jira to Linear",
"how to set up Linear for a startup",
"how does Linear's keyboard navigation work"
]
}
}Call /v1/check with each per-provider mode. The available modes are chatgpt_live, gemini_live, perplexity_live for the UI-scrape variants, plus the standardquick for API-only.
import config from "./prompts.json" assert { type: "json" };
const MODES = ["chatgpt_live", "gemini_live", "perplexity_live", "quick"];
const ALL_PROMPTS = Object.values(config.prompts).flat();
async function runOne(prompt, mode) {
const res = await fetch("https://api.mentionsapi.com/v1/check", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.MENTIONSAPI_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
mode,
query: prompt,
track_brands: [config.brand, ...config.competitors],
}),
});
return res.json();
}
const grid = {};
for (const prompt of ALL_PROMPTS) {
grid[prompt] = {};
for (const mode of MODES) {
grid[prompt][mode] = await runOne(prompt, mode);
}
}
console.log(JSON.stringify(grid, null, 2));Note: Claude UI scraping is not yet shipped on MentionsAPI (Q3 2026 target due to Claude.ai session expiry under load). Use quick mode for the Claude API view in the meantime.
For each prompt, compute three things: which models mention your brand, what rank, and which models cite which URLs. The gap report is the cross-product.
function buildDiffReport(grid, brand) {
const report = [];
for (const [prompt, modes] of Object.entries(grid)) {
const row = { prompt, byModel: {} };
for (const [mode, data] of Object.entries(modes)) {
const me = data.brands?.find(b => b.name === brand);
row.byModel[mode] = {
mentioned: !!me?.mentioned,
rank: me?.rank ?? null,
cited_url: data.citations?.find(c => c.url.includes(brand.toLowerCase()))?.url ?? null,
};
}
// Find the model with the most favorable result
const favorable = Object.entries(row.byModel).filter(([, v]) => v.mentioned).sort((a, b) => (a[1].rank ?? 99) - (b[1].rank ?? 99))[0]?.[0];
const hostile = Object.entries(row.byModel).filter(([, v]) => !v.mentioned).map(([k]) => k);
row.favorable_model = favorable;
row.hostile_models = hostile;
report.push(row);
}
return report;
}
console.table(buildDiffReport(grid, "Linear"));For each prompt where one model is favorable and another is hostile, study the citations on the favorable model. Those URLs are the authoritative sources for that prompt on that model. If you can get cited by them, the hostile model starts mentioning you too (eventually, after re-grounding).
The full feedback loop: read the gap report weekly, pick the 3 most actionable gaps, run outreach against the citation list of the favorable model, re-measure after 6-12 weeks. Repeat.
What this costs
30 prompts × 4 modes × $0.30 average = $36 per grid run. Run weekly: $144 a month at full price. Cache hits drop that to $50-$70 in practice. Add 10 more prompts and you are still under $100 a month.
Why this matters more than the rollup score
One specific pattern shows up in real data. A brand has 50% overall AI visibility (decent). The breakdown reveals 80% on ChatGPT, 70% on Perplexity, 30% on Claude, 20% on Gemini. The average (50%) is misleading. The Claude and Gemini gaps are where almost all the optimization headroom lives.
A team that optimizes for the overall score will spend evenly across surfaces. A team that runs an LLMO loop will pour 80% of its content investment into Claude and Gemini gaps and leave ChatGPT alone (where they already win). The second team will see twice the lift on the same budget.
Frequently asked questions
What is LLMO and how is it different from GEO?
Why measure per-model behavior instead of just averaging?
What kind of optimizations does an LLMO tool inform?
Can I run an LLMO loop without all four LLMs?
How is this different from running a benchmark?
Should I optimize for the model with the most users or the most citations?
Ship it next week
Day one: build the prompt grid and the per-model run loop. Day two: build the diff and gap report. Day three: pick three priority gaps and start outreach against their citation lists. Day forty-five: re-run the grid and check whether the hostile models flipped.
LLMO is slower than rank tracking but it is more actionable. Each cycle teaches you something specific about how each model treats your category. Six cycles in, you know which surfaces you can win cheaply and which require enterprise-level content investment.
That is the kind of intel SaaS GEO tools cannot give you because they aggregate the data away.