Methodology · 10 min read · April 10, 2026

How We Benchmark AI Platforms Against Real Brands — And Why Our First Public Benchmark Is Portable Power Stations

We get asked every week: “Which AI platform is most accurate about my brand?” The honest answer is that nobody — not the platforms themselves, not the major analyst firms — has published a rigorous, reproducible benchmark of AI accuracy at the brand level. We think that gap matters, so we are filling it. This article explains the methodology we use, why we chose portable power stations as our first public category, and how brands in the next categories we will cover can get their data first.

What this article is

A full walkthrough of the Arenza benchmark methodology — so you can evaluate it, critique it, or replicate it.
An announcement of our first public category benchmark (portable power stations) and the schedule for the next few.
If you run marketing at a brand in one of these categories, an invitation to get your specific results before public release.

Why benchmark AI accuracy at all?

Existing research on AI platforms focuses almost entirely on visibility — whether your brand is mentioned, how often it is cited, and which prompts trigger a citation. Visibility is what the emerging GEO (Generative Engine Optimization) category measures. It is useful, but it is only half the picture.

The other half is accuracy. When an AI platform mentions your brand, does it describe you correctly? Does it get the country of origin, the founding year, the product line, the pricing, and the competitive position right? Mid-size brands — our core focus — consistently show material error rates on this second measure, because training data on mid-size brands is uneven and retrieval layers often surface stale or conflated information.

We wrote a full framing of the distinction in GEO vs SEO vs AI Brand Accuracy. This article is the companion piece on how we actually measure it.

The benchmark design in one page

Every Arenza category benchmark follows the same protocol:

12 brands per category. Selected by category revenue and active distribution in at least three of North America, EU, Japan, and Southeast Asia. No pay-to-include; brands cannot sponsor a slot.
25 questions per brand. Structured across five dimensions: identity, product / service offering, positioning, competition, and safety & compliance. The full question template is identical across categories, localized only in product terminology.
9 AI platforms. ChatGPT (GPT-4o and GPT-5), Perplexity, Google Gemini, Microsoft Copilot, DeepSeek, Kimi, Doubao, xAI Grok, and Meta AI. Default system configurations, no personalization history, no paid placements.
Three runs per question. Each prompt is issued three times on each platform to smooth randomness. The reported answer is the majority answer across the three runs.
Ground-truth verification. Every answer is compared against a ground-truth file verified against the manufacturer’s official documentation, current pricing pages, regulatory databases, and independent review publications.
Error classification by business impact. We do not treat all errors equally. A safety-dimension error (wrongly attributed recall) counts more than a stylistic error. The classification rubric is published alongside the benchmark so others can apply it to their own data.

The five dimensions in detail

The question template covers the same five dimensions every time, because these are the facts an AI-researching buyer actually asks about.

Identity — Who is this company? Country of origin, founding year, ownership structure, scale. Errors here undermine trust signals for B2B buyers and regulated procurement.
Product / Service Offering — What do they sell? Flagship products, feature sets, service scope, pricing tiers. Errors here directly affect purchase decisions. Applies equally to physical products, software, and professional services.
Positioning — Who is this company for? Target segments, use-case fit, differentiation. Errors here send buyers to the wrong competitor.
Competition — How does this company compare? Head-to-head questions with the most common alternatives. Errors here bias purchase shortlists.
Safety & Compliance — Any issues the buyer should know about? Recalls, lawsuits, compliance status. Highest-stakes dimension — errors here can be attributed to the wrong brand entirely.

Why portable power stations first

We chose portable power stations as the first public category for three reasons:

Dense with mid-size global brands. Jackery, EcoFlow, Anker SOLIX, Bluetti, Goal Zero, DJI Power, and a long tail of challengers. This is exactly the tier where AI error rates tend to spike.
Real consumer demand signal. “Best portable power station” is a high-volume query on both Google and AI platforms. The category is being actively researched by buyers using AI tools, so the accuracy gap has tangible commercial consequences.
Cross-geography. Strong presence in North America, Europe, Japan, and China. The platforms with Chinese-language training corpora (Doubao, DeepSeek, Kimi) and the platforms with English-language corpora (ChatGPT, Gemini, Grok) give us a clean contrast.

Upcoming categories in the public-benchmark pipeline, in order: robot vacuums, smart locks, DTC skincare, standing desks, and professional services (consulting). Each benchmark publishes here as it completes.

What the published benchmark will include

When the portable power station benchmark publishes, the article will contain:

A cross-platform visibility matrix (each brand, each platform, how often the brand appears in relevant answers).
Error rate per brand per platform, broken down by dimension.
Notable examples of each error type — country-of-origin errors, stale pricing, feature attribution errors, safety misattribution — with the original prompts and responses.
A platform leaderboard for the category — which AI platforms were most accurate overall.
Concrete recommendations for brands in the category on which errors to prioritize fixing.

We share every brand’s specific findings with that brand privately before public release, to give them a window to respond or correct. We think that is the ethical default for any benchmark that names real companies.

What this is not

A few things we want to be explicit about:

It is not a visibility ranking. We are measuring whether AI platforms describe brands correctly, not whether they mention them often. These are related but distinct questions.
It is not sponsored. Brands cannot pay to be included, excluded, or rated favorably. We will publish our selection criteria with every benchmark and stick to them.
It is not static. AI platforms change weekly. A benchmark is a snapshot. We will refresh each category on roughly a quarterly cadence and timestamp the results.
It is not a substitute for a brand-specific scan. A category benchmark tells you where your category sits on average. A brand-specific scan tells you exactly what AI says about your brand today. Both are useful; they answer different questions.

If you run marketing at one of these brands

If your company is in one of the categories above — portable power stations, robot vacuums, smart locks, DTC skincare, standing desks, or professional services — we will run a full Arenza scan on your brand before the public benchmark lands, at no cost. You get your complete per-platform results first, and can act on them before the numbers become public. Reply to the scan-confirmation email and we will prioritize your category.

If your category is not yet on the list, tell us. We pick categories based on reader demand and density of mid-size brands, and an inbound request from a marketing team is the strongest signal.

Get your brand’s results before the benchmark publishes

Run a free scan across 9 AI platforms today. If your brand is in a category we are benchmarking, we will give you the public-benchmark data on your brand first.

Scan my brand →