How we measure that the brand scan works
The Barnum effect is the biggest risk with AI reports. Schwung systematically measures whether a brand scan is truly about your organisation, or just sounds convincing.
Imagine you buy an AI tool for brand analysis. The sales page promises in-depth insights, a personalised diagnosis and sharp recommendations. How do you know it delivers? Almost no provider measures its own quality systematically. Schwung does, and the results are interesting enough to share.
Claiming is easy, measuring is something else
Most AI tools prove their quality with testimonials and screenshots. Understandable, but it doesn't solve the real problem: how do you know whether a report is really about your organisation, or just sounds good enough to feel convincing?
That second risk has a name. In 1948, psychologist Bertram Forer gave his students a personality description and asked how accurate it was. On average they rated it 4.26 out of 5. The descriptions were identical and generic. Students recognised themselves in them, not because the description was correct, but because it was vague enough to fit anyone. This is called the Barnum effect, and it is the biggest quality risk in AI reports: a feeling of recognition is not proof of specificity.
The conclusion that follows is uncomfortable. If you want to break through the Barnum effect, there is only one defence: the report has to say something that can only apply to this specific organisation.
Four layers for measuring scan quality
Schwung uses four layers to test whether a report clears that bar.
The first layer is a sharpness score on five criteria. Does it surprise the reader? Is it specific to this organisation? Does it point to a concrete first step? Is it non-interchangeable with five other clients? Are multiple sources interwoven? Each question yields zero to two points, together a scale from zero to ten. A report that scores eight or higher lands: the client says "they saw us". A report between five and seven is competent but forgotten within a day. Anything below that is boilerplate with the client's name dropped in.
The second layer is a jury of three reviewers. A single model that judges its own output is not reliable enough. Schwung uses several independent reviewers, each scoring separately, after which the spread becomes visible. If three reviewers agree, that is a strong signal. If they diverge, that is information too.
The third layer is a website delta test. Does the report contain information that is not on the organisation's website? A report that only echoes what the website already says is an expensive summary. The test measures whether the scan adds something to what the organisation already communicates itself.
The fourth layer is the interchangeability test. Would this report, with different names plugged in, also be true for five comparable organisations? If the answer is yes, the report is not a diagnosis but a template. This is the operational countermeasure against the Barnum effect.
The current standing in June 2026: average sharpness 8.3 out of 10, and one hundred per cent of measured scans land in the top band. Added value beyond the website scores an average of 8 out of 10.
What an independent model made of it
In June 2026, Schwung asked Google Gemini to assess three real reports. Without prompting, without access to the internal scores, without context about the methodology. Gemini called the reports "specific, local and painfully accurate, not boilerplate marketing talk". The model recognised the underlying techniques itself: differential diagnosis, active listening and what it described as an adversarial jury panel.
That is interesting, and it calls for an honest caveat. An AI assessing the work of another AI is a serious quality proxy, not an external audit and not client validation. The value of the Gemini test does not lie in the absolute score. It lies in the fact that an independent model, without prompting, arrives at the same conclusion as the internal jury. That is convergence, and convergence is not proof.
Three cases, three opposing recommendations
The most direct evidence that there is no template behind the scan is a comparison of outcomes. Three organisations, the same tool, three radically different diagnoses.
An educational foundation with five schools lacks a shared brand foundation. Each school communicates from its own identity; the umbrella organisation has no story that ties the schools together. The recommendation: start with the essence, not with the expressions.
A talent agency already has the foundation in place. The driving force is sharp, the internal culture is recognisable. The problem lies in visibility: the proposition has not been translated to the market. The recommendation: the work is not building the foundation, the work is making the foundation visible.
Schwung.ai itself scored in the same round comparably to the talent agency. Strong foundation, recognisable position, but the outward translation lags behind the internal clarity. That is a different task from the educational foundation's, even though both initially sound like "the brand isn't clear enough".
Three organisations, three opposing recommendations. That is what happens when there is no template behind it.
How best-of-N safeguards quality
Alongside jury evaluation after the fact, Schwung also uses a technique at the front end: best-of-N sampling. For the core tension and the first concrete step in each report, several independent versions are generated, after which a jury picks the sharpest. The approach demonstrably improves LLM output quality, without retraining the model.
Translated to the brand scan: the opening of a report, the core tension that carries the whole, is not the first version the model produces. It is the version a jury rated as the sharpest. That costs more compute time, and the quality difference is measurable.
Honest about what the measurement isn't
A language model assessing the work of a language model has a structural weak spot. It can be systematically wrong in a way you don't see. Schwung's approach lowers that risk with an adversarial jury panel and an external blind test; it does not eliminate it.
What the measurement is: a transparent quality threshold that applies to every report, made public so you can judge it. What the measurement is not: a substitute for client validation. The only real test is whether an organisation, after reading the report, says "this is right, I hadn't seen it this way myself". That test takes place in the conversation, not in a benchmark.
The question the report itself must withstand
During the conversation that drives the scan, you can correct the reading. The scan is conversational: if an observation isn't right, you say so, and the report adjusts. Afterwards, Schwung measures each report again with the sharpness score, so that quality doesn't slip as more scans are carried out.
The question you can ask yourself after reading a report: would this also be true for five comparable organisations? If yes, the report is not sharp enough. If no, the scan has done its job.
Schwung works like this: first get the position and behaviour clear, only then the expression. The brand scan is the first step in that process, and the quality measurement is how we take that first step seriously.
Further reading
Sources
- JudgeBench: A Benchmark for Evaluating LLM-based Judges · 2025
- CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation · 2026
- Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better · 2024
- The Sophisticated Barnum Effect: How AI Became the Ultimate Yes-Man · 2025
- 'Specially For You' – Examining the Barnum Effect's Influence on the Perceived Quality of System Recommendations · 2023
- Top 10 AI Brand Visibility Tools in 2026 - InLinks · 2026
Go on, challenge us.
A brand, design or marketing question? We are happy to think along.