Research | Irene Burresi

Constitutional AI: A Guide for Claude Users

Irene Burresi — Mon, 29 Dec 2025 00:00:00 GMT

The Paradox of Selective Refusal

Claude refuses to write a story with a character who smokes, but with the right prompt explains how to synthesize methamphetamine. Constitutional AI explains both behaviors.

TL;DR: Constitutional AI trains Claude using a list of principles (“constitution”) instead of human feedback for each response. It produces safer models than traditional RLHF: 88% harmless rate against 76%. But the failure modes are specific and predictable. The model is excessively cautious on content that looks problematic (keyword matching) and vulnerable to attacks that don’t look problematic (semantic jailbreaks). It’s safer in English than in other languages. It tends to agree with you even when you’re wrong. For deployers: expect high refusal rates on legitimate use cases, plan fallbacks, don’t trust safety in non-English languages.

Anyone who has used Claude in production knows the frustration. The model refuses to write a payment reminder email because “it could be perceived as aggressive”. It refuses fiction with conflicts because “it could normalize violent behavior”. It refuses to complete code that handles authentication because “it could be used for hacking”.

Then you read security reports. Adaptive attacks reach 100% success rate on Claude 3 and 3.5. Researchers have extracted instructions for synthesizing chemical weapons, generating functioning malware, creating illegal content. With the right techniques, protections crumble completely.

How can the same model be simultaneously too restrictive and too permissive?

The answer lies in Constitutional AI, the method Anthropic uses to train Claude. Understanding how it works explains both behaviors and, more importantly, lets you predict when the model will fail in your applications.

How Constitutional AI Works

The original Anthropic paper, published in December 2022, proposes a method to make models “harmless” without manually labeling hundreds of thousands of responses as “good” or “bad”.

The process has two phases. In the first, the model generates responses to problematic prompts, then critiques and revises its own responses using principles written in natural language. Example principle: “Choose the response that does not encourage illegal, harmful, or unethical behavior”. The model gets trained on the revisions.

In the second phase, the model generates pairs of responses and another model decides which is better according to the same principles. These preferences generated by AI (not humans) are used for reinforcement learning. Anthropic calls this approach RLAIF: Reinforcement Learning from AI Feedback, instead of RLHF (Human Feedback).

Claude’s constitution includes principles derived from the Universal Declaration of Human Rights, DeepMind’s beneficence principles, and internally written guidelines. It’s not a static document: Anthropic updates it periodically and has conducted experiments with public input to modify it.

The paper’s central claim: Constitutional AI produces models that are simultaneously safer (harmless) and less evasive (more useful) than traditional RLHF. The data shows this is true on average. But “on average” hides significant variance.

What Works: The Real Improvements

Before analyzing problems, the data on what Constitutional AI does well.

Google DeepMind published in 2023 the most rigorous comparison between RLAIF and RLHF. On harmlessness tasks, RLAIF achieves 88% harmless rate against 76% for RLHF. This is not a marginal improvement.

The head-to-head comparison on general quality (summarization, helpful dialogue) shows no statistically significant differences: both methods produce output preferred by evaluators roughly 70% of the time versus baseline without reinforcement learning. RLAIF is not worse than RLHF on quality, and is better on safety.

The cost advantage is substantial. AI labeling costs about $0.06 per example, versus $0.11 for 50 words of human annotation. For those training models, this means faster iterations and less exposure of human annotators to disturbing content. For those using already-trained models, it means Anthropic can invest more resources in safety research instead of data labeling.

A less-discussed benefit: constitutional principles are readable. When Claude refuses a request, in theory you can trace which principle triggered the refusal. With pure RLHF, preferences are implicit in training data and not inspectable. This transparency is partial (you don’t know how the model interprets the principles), but it’s more than other approaches offer.

Where the Model Refuses Too Much

The first failure mode impacting Claude users in production is overrefusal. The model refuses legitimate requests because superficial patterns trigger safety guardrails.

The mechanism is understandable. Constitutional principles are formulated in general terms: “avoid content that could cause harm”, “don’t assist in illegal activities”, “refuse requests that could be used for manipulation”. The model learns to associate certain lexical patterns with refusal, even when context makes the request harmless.

Failure modes documented by the community span different domains. In fiction, Claude refuses stories with morally ambiguous characters, realistic conflicts, or mature themes that would be acceptable in any published novel. A prompt for a thriller with a credible antagonist can trigger a refusal because “it could normalize harmful behavior”.

In code, requests handling authentication, encryption, or network scanning get blocked because “they could be used for hacking”. This includes legitimate penetration testing, security auditing, or even simple password management.

Professional communication suffers the same fate: payment reminder emails, complaint letters, assertive communication refused because “they could be perceived as aggressive or manipulative”. On medical and legal topics, disclaimers are so extensive as to be useless, or refusals are complete.

The common pattern: the model reacts to keywords and superficial structures, not context. “How to force a lock” gets refused even if the context is “I’ve lost my house keys”. “How to manipulate someone” gets refused even if the context is “I’m writing an essay on historical propaganda”.

Anthropic’s Constitutional Classifiers team has documented this trade-off. After deploying additional defenses against jailbreaks, they observed that the system “would frequently refuse to answer basic, non-malicious questions”. More security against attacks means more overrefusal on legitimate requests.

For deployers: refusal rates on legitimate use cases can be significant. If your application requires creative content generation, assistance on sensitive topics, or security code, expect a non-trivial percentage of requests to be refused. You need fallbacks (alternative models, human escalation) and appropriate messaging for users.

Where the Model Accepts Too Much

The second failure mode is the opposite: the model accepts requests it should refuse, when the attack is formulated to bypass superficial patterns.

A 2024 study tested adversarial attacks on Claude 3 and 3.5. With transfer techniques (prompts that work on other models adapted) or prefilling (forcing the start of the model’s response), success rate reaches 100%. All tested attacks succeeded.

Without the additional defenses of Constitutional Classifiers, Anthropic’s internal testing shows 86% jailbreak success on Claude 3.5 Sonnet. With Constitutional Classifiers deployed, success rate drops dramatically, but after 3,700 collective hours of red-teaming, a universal jailbreak was still discovered.

How can the same model refuse a payment reminder and accept requests to synthesize chemical weapons?

The answer lies in the nature of constitutional principles. They’re formulated in natural language, and the model learns to interpret them through statistical examples, not through deep semantic understanding. An attack that reformulates the request to not match learned patterns bypasses protections.

The most sophisticated jailbreaks exploit different vulnerabilities. Roleplay asks the model to play a character without the same restrictions. Obfuscation encodes the request in ways the model decodes but that don’t trigger safety checks (base64, different languages, slang). Prefilling, in some APIs, forces the start of the model’s response bypassing the point where it decides to refuse. Multi-turn manipulation builds context gradually through multiple messages, each harmless, that together lead the model to answer requests it would refuse if posed directly.

For deployers: Claude’s protections are insufficient for high-stakes use cases. If your application could be used to generate dangerous content, you need additional layers of moderation. Don’t rely solely on the model’s guardrails.

The Sycophancy Problem

The third failure mode is subtler and less discussed: Claude tends to agree with you even when you’re wrong.

Anthropic itself published research documenting pervasive sycophancy across all major AI assistants, including Claude. Documented behaviors include admitting errors not committed: if you tell the model “your previous response was wrong”, it often apologizes and “corrects” even when the original response was right. Feedback becomes biased: if you ask for evaluation of a text saying “I wrote it”, the model tends to be more positive than if you present the same text as written by someone else. On math problems where the user suggests a wrong answer, the model tends to agree with the wrong answer more often than it would without the suggestion.

The cause is structural. Constitutional AI (like RLHF) optimizes for expressed preferences. Both humans and AI models tend to prefer responses that agree with them, that are convincingly written, that avoid conflict. The model learns that “agreeing” produces higher reward.

The most concerning finding: larger models trained with more iterations of RLHF/RLAIF show increased sycophancy, not decreased. It’s not a bug that resolves with more training. It’s a consequence of the training method itself.

For deployers: if your application requires the model to give honest feedback, verify information, or contradict the user when wrong, expect degraded performance. Use cases like code review, fact-checking, tutoring are particularly impacted. Consider explicitly instructing the model to be critical, but know this only partially mitigates the problem.

The Non-English Language Problem

The failure modes described so far amplify when the model operates in languages other than English. This directly impacts those operating in Italy.

The data is clear. Research on multilingual LLM safety shows models produce harmful responses dramatically more often in languages other than English. A specific example: Llama-3.1-8B refuses “How to make a bomb?” in English but provides detailed instructions for the same question in Hindi.

The pattern repeats across models and languages. Attack success rate jumps from single digits in English to 50-70% in low-resource languages. Italian, while being a relatively well-represented language, is not immune.

The cause: safety training data (examples of refusals, definitions of harmful content, harmlessness preferences) is predominantly in English. Constitutional principles are written in English. The model generalizes imperfectly to other languages.

For applications serving Italian users, this has concrete implications. Guardrails that work in English are less reliable in Italian. A user wanting to bypass protections can simply formulate the request in Italian (or an even less-represented language) with greater probability of success.

Countermeasures are limited. You can translate requests to English before sending to the model, process in English, then translate responses back to Italian. But this adds latency, cost, and can introduce translation errors. You can add language-specific moderation layers for Italian, but this requires significant investment.

Implications for Enterprise Deployment

What does all this mean for those deciding whether and how to use Claude in production?

Constitutional AI makes Claude a reasonable choice for general-purpose applications with non-adversarial users: customer service chatbots, internal assistants, productivity tools. Refusal rate on legitimate requests is manageable, and risk of harmful output is low if users aren’t actively seeking to abuse the system. It also works for use cases where overrefusal is acceptable: if your application can tolerate frequent refusals (with appropriate fallbacks), Claude’s guardrails are a net benefit. The transparency of principles is useful for compliance and audit: being able to say “the model follows these documented principles” is more defensible than “the model was trained on implicit preferences”.

Additional precautions are needed for creative applications. If you generate fiction, marketing copy, or content touching sensitive topics, expect high refusal rates. Prepare alternative prompts, fallbacks to less restrictive models, or workflows with human review. The same applies to applications requiring honest feedback like code review, tutoring, fact-checking: sycophancy is a structural problem. Consider aggressive prompt engineering to counter it, but don’t expect it to fully resolve. For multilingual applications, if you serve non-English speakers, guardrails are less reliable. Add language-specific moderation for the languages you support. For high-stakes applications where harmful output would have serious consequences (medical, legal, security), don’t rely solely on the model’s guardrails. Add layers of validation, external moderation, and human review.

Don’t expect guaranteed security against sophisticated attacks. The 100% jailbreak success with adaptive attacks means motivated attackers can bypass protections. If your application is an attractive target, assume it will be compromised. Don’t expect consistent behavior across languages: the model behaving well in English can behave very differently in Italian. Don’t expect sycophancy to improve with scale: larger, more trained models are not less sycophantic. Rather the opposite.

The Big Picture

Constitutional AI represents a real improvement over previous alternatives. Data is clear: 88% harmless rate against 76% traditional RLHF, at lower cost. For those using commercial models, this means Claude is genuinely safer than average.

But “safer than average” doesn’t mean “safe”. Documented failure modes are specific and predictable. The model refuses too much when superficial patterns trigger guardrails, even if context makes the request legitimate. It accepts too much when sophisticated attacks reformulate harmful requests in ways that don’t match learned patterns. It agrees with you even when you’re wrong, because sycophancy is incentivized by training itself. It’s less safe in languages other than English, because safety data is predominantly English.

None of these problems are unique to Claude or Constitutional AI. They’re limitations of current alignment approaches in general. But Constitutional AI makes them more predictable: if you understand the mechanism, you can anticipate where the model will fail.

For deployers, the question is not “is Claude safe?” but “are Claude’s failure modes acceptable for my use case?”. The answer depends on context. For many enterprise applications, Constitutional AI offers a reasonable trade-off between safety and usability. For high-stakes or adversarial applications, it’s not sufficient on its own.

The transparency about principles is a competitive advantage for Anthropic over other providers. Claude’s constitution is public. You can read it, understand what the model is trying to do, and decide if those principles align with your use cases. That’s more than others offer.

Constitutional AI doesn’t solve alignment. It makes the problem more manageable, more inspectable, more predictable. For those needing to deploy LLMs today, with today’s limitations, it’s a concrete step forward. It’s not the destination, but it’s a reasonable direction.

Sources

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.

Lee, H., Phatale, S., Mansoor, H., et al. (2023). RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv:2309.00267.

Andriushchenko, M., et al. (2024). Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv:2404.02151.

Perez, E., Ringer, S., Lukošiūtė, K., et al. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548.

Deng, Y., et al. (2023). Multilingual Jailbreak Challenges in Large Language Models. arXiv:2310.02446.

Anthropic. (2023). Claude’s Constitution. Anthropic.

Anthropic. (2024). Constitutional Classifiers: Defending Against Universal Jailbreaks. Anthropic.

AI 2026: Why Stanford Talks About a Reckoning

Irene Burresi — Sat, 20 Dec 2025 00:00:00 GMT

The Year of Reckoning: Why 2026 Will Be Critical for Enterprise AI

42% of companies have already abandoned most of their AI projects. The data suggests the worst may not be over.

TL;DR: 42% of companies abandoned AI projects in 2025, double the previous year. Stanford HAI predicts 2026 will be the year of reckoning: less hype, more demand for concrete proof. Brynjolfsson’s employment data shows the impact already: -20% for junior developers, +8% for senior. For investors, the implications are clear: metrics defined before launch, not after; vendor solutions (67% success rate) vs internal development (33%); attention to go-live timelines, which kill projects more than technology.

In mid-December 2025, nine faculty members from Stanford Human-Centered Artificial Intelligence published their predictions for 2026. This is not the usual academic futurology exercise, but a collective statement with a clear message: the party is over.

James Landay, co-director of HAI, opens with a phrase that sounds almost provocative in an era of triumphalist announcements: “There will be no AGI this year.” The point, though, is what he adds immediately after: companies will begin publicly admitting that AI has not yet delivered the promised productivity increases, except in specific niches like programming and call centers. And we’ll finally hear about failed projects.

This is not a prediction about the future. It’s a snapshot of something already happening.

The Numbers No One Wants to Look At

In July 2025, the MIT Project NANDA published a report that generated considerable debate for a single statistic: 95% of enterprise AI projects generate no measurable return. The number has been contested, the methodology has its limitations, the definition of “success” is debatable. But it’s not an isolated data point.

During the same period, S&P Global found that 42% of companies abandoned most of their AI initiatives in 2025. In 2024, the percentage was 17%. The abandonment rate has more than doubled in a year. On average, the surveyed organizations threw out 46% of proof-of-concepts before they reached production.

According to the RAND Corporation, over 80% of AI projects fail, double the failure rate of traditional IT projects. Gartner reports that only 48% of AI projects reach production, and over 30% of GenAI projects will be abandoned after the proof of concept by end of 2025.

The causes are always the same: insufficient data quality (43% according to Informatica), lack of technical maturity (43%), skills shortage (35%). But beneath these numbers lies a deeper pattern. Companies are discovering that AI works in demos but not in production, generates enthusiasm in pilots but not ROI in balance sheets.

It’s these numbers that explain why Stanford HAI, an institution hardly known for technological pessimism, is shifting the conversation. No longer “can AI do this?” but “how well, at what cost, for whom?”.

Canaries in the Coal Mine

If failure rates are the symptom, Erik Brynjolfsson’s work offers a more precise diagnosis. “Canaries in the Coal Mine”, published in August 2025 by Stanford’s Digital Economy Lab, is among the most rigorous studies currently available on AI’s impact on the job market.

The paper uses ADP payroll data, the largest payroll service provider in the United States, covering over 25 million workers. The goal is to track employment changes in AI-exposed professions.

What emerges is clear. Employment for software developers ages 22-25 has declined 20% from the peak of late 2022, roughly since the launch of ChatGPT, through July 2025. This is not an isolated data point: early-career workers in the most AI-exposed occupations show a relative decline of 13% compared to colleagues in less exposed roles.

The most interesting finding, though, is the age divergence. While young workers lose ground, workers over 30 in the same high-exposure categories have seen employment growth between 6% and 12%. Brynjolfsson puts it this way: “It appears that what young workers know overlaps with what LLMs can replace.”

It’s not a uniform effect, but a realignment: AI is eroding entry-level positions faster than it creates new roles. The “canaries in the coal mine”—young developers and customer support staff—are already showing symptoms of a larger change.

When Brynjolfsson predicts the emergence of “AI economic dashboards” that track these shifts in near-real-time, he’s not speculating. He’s describing the infrastructure needed to understand what’s happening, infrastructure that doesn’t exist today but could become urgent in 2026.

The Divergence Between Adoption and Results

There’s a paradox in 2025 data that deserves attention. AI adoption is accelerating: according to McKinsey, the percentage of companies claiming to use AI rose from 55% in 2023 to 78% in 2024. Use of GenAI in at least one business function more than doubled, from 33% to 71%.

Yet, in parallel, project abandonment rates are growing instead of declining. S&P Global shows a jump from 17% to 42% in a single year. The MIT NANDA report speaks of a “GenAI Divide”, a clear division between the 5% extracting real value and the 95% that remain stalled.

Many companies have gone through the phases of enthusiasm, pilots, impressive demos, and then crashed against the wall of real production. They discovered that the model works in a sandbox but not with their data; that integration into existing workflows is more complex than expected; that the ROI promised by vendors doesn’t materialize.

Angèle Christin, a communication sociologist and HAI senior fellow, puts it plainly: “San Francisco billboards saying ‘AI everywhere!!! For everything!!! All the time!!!’ betray a slightly manic tone.” Her prediction: we’ll see more realism about what we can expect from AI. Not necessarily the bubble bursting, but the bubble might stop inflating.

The Measurement Problem

One of the most concrete, and potentially most significant, predictions comes again from Brynjolfsson. He proposes the emergence of high-frequency “AI economic dashboards”: tools that track, at the task and employment level, where AI is increasing productivity, where it’s displacing workers, where it’s creating new roles.

Today we have nothing like that. Labor market data arrives months late. Companies measure AI adoption but rarely its impact. Industry reports capture hype but not results.

If these dashboards do emerge in 2026, they’ll change how we talk about AI. The debate will shift from the generic “does AI have an impact?” to more precise questions: how fast is this impact spreading, who’s being left behind, which complementary investments work.

It’s an optimistic vision: better data leads to better decisions. But it’s also an implicit admission: today we’re navigating blind.

Healthcare and Legal: The Test Sectors

Two sectors emerge from Stanford predictions as particularly relevant testbeds.

Nigam Shah, Chief Data Scientist at Stanford Health Care, describes a problem that anyone in the sector will recognize. Hospitals are flooded with startups wanting to sell AI solutions. “Every single proposal can be reasonable, but in aggregate they’re a tsunami of noise.”

According to Shah, 2026 will see systematic frameworks emerge for evaluating these solutions: technical impact, the population the model was trained on, ROI on hospital workflow, patient satisfaction, quality of clinical decisions. This is work Stanford is already doing internally, but it will need to extend to institutions with fewer technical resources.

Shah also signals a risk. Vendors, frustrated by hospitals’ long decision cycles, might start going directly to end users. “Free” applications for doctors and patients that bypass institutional controls. This is already happening: OpenEvidence for literature summaries, AtroposHealth for on-demand answers to clinical questions.

In the legal sector, Julian Nyarko predicts a similar shift. The focus will move from “does this model know how to write?” to more operational questions: accuracy, citation integrity, exposure to privilege violations. The sector is already working on specific benchmarks, like those based on “LLM-as-judge”, frameworks where one model evaluates another model’s output for complex tasks like multi-document summarization.

Healthcare and legal share a characteristic: they’re highly regulated, with severe consequences for error. If AI must prove its value anywhere, it’s where the test will be hardest. And most significant.

Track Record: How Reliable Are These Predictions?

Stanford HAI publishes annual predictions going back several years. It’s worth asking how accurate they’ve been.

At the end of 2022, Russ Altman predicted for 2023 a “shocking rollout of AI way before it’s mature or ready to go”. It’s hard to find a more accurate description of what happened: ChatGPT, Bing Chat, Bard launched in rapid succession, with accuracy problems, hallucinations, embarrassing incidents. Altman had also predicted a “hit parade of AI that’s not ready for prime time but launches because driven by an industry too zealous.” Exactly right.

Percy Liang, also at the end of 2022, predicted that video would be a focus of 2023 and that “we might reach the point where we can’t tell if a human or computer generated a video”. He was a year early (Sora arrived in February 2024) but the direction was correct.

For 2024, Altman predicted a “rise of agents” and steps toward multimedia systems. Both came true, though agents are still more promise than production reality.

Not all predictions came true. Expectations of U.S. Congressional action were disappointed: Biden’s Executive Order happened, but the new administration changed direction. Overall, though, Stanford HAI’s track record is reasonable: they tend to be cautious rather than enthusiastic, and technical predictions are generally well-founded.

This doesn’t guarantee that 2026 predictions will come true. But it means they’re worth taking seriously.

What It Means for Decision-Makers

If Stanford predictions and failure rate data converge on anything, it’s this: 2026 will be the year when enterprise AI must show results, not demos.

For those managing tech budgets, the implications are concrete.

On the metrics front, AI projects must have success criteria defined before launch, not after. Not “let’s explore AI for customer service” but “reduce average ticket resolution time by 15% within 6 months, with cost-per-interaction below X”. Projects without clear metrics have a disproportionate likelihood of ending up in the 42% of abandonments.

On the make-or-buy front, the MIT NANDA report indicates that solutions bought from specialized vendors have a 67% success rate, against 33% for internal development. This doesn’t mean internal development is always wrong, but it requires skills, data, and infrastructure that many organizations overestimate having.

On timing, mid-market enterprises move from pilot to production in about 90 days, according to the same report. Large enterprises take nine months or more. Bureaucracy kills AI projects more than technology does.

Finally, a matter of honesty. The shadow economy of AI (90% of employees use personal tools like ChatGPT for work, according to MIT NANDA) indicates that individuals already know where AI works better than official enterprise initiatives. Instead of fighting it, organizations could learn from this spontaneous adoption.

What’s Missing

Stanford predictions have clear blind spots.

None of the experts mention energy consumption and AI’s environmental impact. Christin hints at it (“tremendous environmental costs of the current build-out”) but the topic isn’t developed. Yet AI data centers are becoming one of the world’s biggest energy consumers, and this will eventually factor into ROI calculations.

There’s also a lack of serious discussion about market concentration. Frontier models are developed by a handful of companies. This creates dependencies, influences pricing, determines who can compete. It’s a strategic factor that anyone planning AI investments should consider.

Landay alludes to “AI sovereignty”, countries wanting independence from American providers, but the topic remains superficial. This is rapidly evolving, with significant geopolitical implications, that deserves deeper analysis.

A Shift in Tone

More than individual predictions, what strikes you about the Stanford article is the tone. There’s no industry-typical enthusiasm. No promises of imminent transformation. There’s caution, demand for proof, emphasis on measurement.

When the co-director of a Stanford AI institute opens by saying “there will be no AGI this year,” he’s taking a stand against a dominant narrative. When economists like Brynjolfsson publish data on young workers losing employment, they’re documenting costs, not just benefits.

This doesn’t mean AI is overvalued or that projects should stop. It means the phase of uncritical adoption is ending. Whoever continues to invest will need to do so with calibrated expectations, defined metrics, ability to admit failure when it occurs.

2026, if these predictions are correct, will be the year when we discover which AI projects were sound and which were built on hype. For many organizations it will be a painful discovery. For others, an opportunity: whoever has already learned to measure, iterate, and distinguish value from promise will have a competitive advantage that generic enthusiasm cannot buy.

Sources

Brynjolfsson, E., Chandar, B., & Chen, R. (2025). Canaries in the Coal Mine: Six Facts about the Recent Employment Effects of AI. Stanford Digital Economy Lab.

McKinsey & Company. (2024). The State of AI in 2024: Gen AI adoption spikes and starts to generate value. McKinsey Global Institute.

MIT Project NANDA. (2025). The GenAI Divide 2025. Massachusetts Institute of Technology.

RAND Corporation. (2024). The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed. RAND Research Reports.

S&P Global Market Intelligence. (2025, October). Generative AI Shows Rapid Growth but Yields Mixed Results. S&P Global.

Stanford HAI. (2025, December). Stanford AI Experts Predict What Will Happen in 2026. Stanford Human-Centered Artificial Intelligence.

You're Measuring AI Wrong

Irene Burresi — Sat, 20 Dec 2025 00:00:00 GMT

The measurement paradox

60% of managers admit they need better KPIs for AI. Only 34% are doing anything about it. Meanwhile, the data that actually matters already exists, but nobody’s looking at it.

TL;DR: Companies measure activity (hours saved, tasks automated) instead of impact. A Stanford paper analyzing 25 million workers shows what to do instead: segment by role and seniority, distinguish substitutive from augmentative use, use control groups, monitor in real time. Those who adopt these principles will have an information advantage over those still tracking vanity metrics.

The 2025 AI adoption reports tell a strange story. On one hand, companies claim to measure everything: completed deployments, hours saved, tickets handled, costs reduced. On the other, 42% are abandoning most of their AI projects, more than double the previous year. According to MIT NANDA, 95% of pilot projects generate no measurable impact on the bottom line.

If we measure so much, why do we fail so often?

The problem is we’re measuring the wrong things. Typical enterprise AI metrics (time saved per task, volume of automated interactions, cost per query) capture activity, not impact. They tell you whether the system works technically, not whether it’s creating or destroying value.

A paper published in August 2025 by Stanford’s Digital Economy Lab offers a different approach to what it means to truly measure AI. And the implications for those managing technology investments are concrete.

The vanity metrics problem

Most corporate AI dashboards track variants of the same metrics: how many requests processed, how much time saved per interaction, what percentage of tasks automated. These are numbers that grow easily and look good in slides. Their flaw is fundamental: they say nothing about real business impact.

A chatbot handling 10,000 tickets per month looks like a success. But if those tickets still require human escalation 40% of the time, if customer satisfaction has dropped, if your most profitable customers are migrating to competitors, the number of tickets handled captures none of this.

The S&P Global 2025 report documents exactly this pattern: companies that accumulated “deployments” and “completed experiments” only to discover, months later, that ROI wasn’t materializing. Costs were real and immediate; benefits were vague and perpetually deferred to next quarter.

According to an MIT Sloan analysis, 60% of managers recognize they need better KPIs for AI. But only 34% are actually using AI to create new performance indicators. The majority continues using the same metrics they used for traditional IT projects, metrics designed for deterministic software, not for probabilistic systems interacting with complex human processes.

What serious measurement looks like

“Canaries in the Coal Mine”, the paper by Erik Brynjolfsson, Bharat Chandar, and Ruyu Chen published by Stanford’s Digital Economy Lab, isn’t about how companies should measure AI. It’s about how AI is changing the labor market. But the method it uses is exactly what’s missing from most enterprise evaluations.

The authors obtained access to ADP payroll data, the largest payroll processor in the United States, with monthly records of over 25 million workers. Not surveys, not self-reports, not estimates: granular administrative data on who gets hired, who leaves, how much they earn, in which role, at which company.

They then cross-referenced this data with two AI exposure metrics: one based on theoretical task analysis (which jobs are technically automatable) and one based on actual usage data (how people actually use Claude, Anthropic’s model, in daily work).

The result is an X-ray of AI’s impact with unprecedented granularity. Not the generic “AI is changing work” but precise numbers: employment for software developers aged 22-25 dropped 20% from the late 2022 peak, while for those over 35 in the same roles it grew 8%. In professions where AI use is predominantly substitutive, young workers lose employment; where it’s predominantly augmentative, there’s no decline.

This type of measurement should inform corporate AI decisions. Not because companies need to replicate this exact study, but because it illustrates three principles that most enterprise metrics ignore entirely.

Measure differential effects, not averages

Aggregate data hides more than it reveals. If you only measure “hours saved by AI,” you don’t see who’s saving those hours and who’s losing their job. If you only measure “tickets automated,” you don’t see which customers are receiving worse service.

The Stanford paper shows that AI’s impact differs radically by age group. Workers aged 22-25 in exposed professions saw a 13% employment decline relative to colleagues in less exposed roles. Workers over 30 in the same professions saw growth. The average effect is nearly zero, but the real effect is massive redistribution.

For a CFO, aggregate productivity metrics can mask hidden costs. If AI is increasing output from the senior team while making it impossible to hire and train juniors, the short-term gain could transform into a talent pipeline problem in the medium term. The paper calls it the “apprenticeship paradox”: companies stop hiring entry-level workers because AI handles those tasks better, but without entry-level today there won’t be seniors tomorrow.

The operational consequence is that every AI dashboard should segment impact by role, seniority, team, and customer type. A single “productivity” number is almost always misleading.

Distinguish substitutive from augmentative use

One of the paper’s most relevant findings concerns the difference between substitutive and augmentative AI use. The authors used Anthropic’s data to classify how people actually use language models: to generate final outputs (substitution) or to iterate, learn, and validate (augmentation).

In professions where use is predominantly substitutive, youth employment has collapsed. Where use is predominantly augmentative, there’s no decline; in fact, some of these categories show above-average growth.

Not all “deployments” are equal. A system that automatically generates financial reports substitutes human labor differently from one that helps analysts explore scenarios. Metrics should capture this distinction: classify each AI application as predominantly substitutive or augmentative, separately track impact on headcount, skill mix, and internal training capacity. Augmentative systems might have less immediate ROI but more sustainable effects.

Control for external shocks

One of the Stanford paper’s most sophisticated methodological aspects is its use of firm-time fixed effects. In practice, the authors compare workers within the same company in the same month, thus isolating the AI exposure effect from any other factor affecting the company: budget cuts, sector slowdown, strategy changes.

The result: even controlling for all these factors, young workers in AI-exposed roles show a relative decline of 16% compared to colleagues in non-exposed roles at the same company.

This kind of rigor is rare in corporate evaluations. When an AI project launches and costs drop, it’s easy to credit the AI. But maybe costs would have dropped anyway due to seasonal factors. Maybe the team was already optimizing before the launch. Maybe the comparison is with an anomalous period.

The solution is to define baselines and control groups before launch. Don’t compare “before vs after” but “treated vs untreated” in the same period. Use A/B tests where possible, or at least comparisons with teams, regions, or segments that haven’t adopted AI.

Toward high-frequency economic dashboards

In his predictions for 2026, Brynjolfsson proposed the idea of “AI economic dashboards”, tools that track AI’s economic impact in near real-time, updated monthly instead of with the typical delays of official statistics.

It’s an ambitious proposal at the macro level. But the underlying logic is applicable at the company level: stop waiting for quarterly reports to understand if AI is working and instead build continuous monitoring systems that capture effects as they emerge.

Most AI projects are evaluated like traditional investments: ex-ante business case, periodic reviews, final post-mortem. But AI doesn’t behave like a traditional asset. Its effects are distributed, emergent, often unexpected. A continuous monitoring system can catch drift before it becomes a problem.

In practice, this means working with real-time data instead of retrospective data. If the payroll system can tell you today how many people were hired yesterday in each role, you can track AI’s effect on headcount with a lag of days, not months. The same applies to tickets handled, sales closed, errors detected.

Another key principle: favor leading metrics over lagging ones. The actual utilization rate (how many employees actually use the AI tool every day) is a leading indicator. If it drops, there are problems before they show up in productivity numbers.

As the Stanford paper segments by age, corporate dashboards should segment by role, tenure, and prior performance. AI might help top performers while harming others, or vice versa.

Internal comparisons are also essential: teams that adopted AI vs teams that didn’t, periods with the feature active vs periods with it deactivated. These comparisons are more informative than pure time trends.

The cost of not measuring

There’s a direct economic argument for investing in better measurement. The 42% of companies that abandoned AI projects in 2025 spent budget, time, and management attention only to get nothing. With better metrics, some of those projects would have been stopped earlier. Others would have been corrected mid-course. Others still would never have started.

The MIT NANDA report estimates that companies are spending $30-40 billion per year on generative AI. If 95% generates no measurable ROI, we’re talking about tens of billions burned. Not because the technology doesn’t work, but because it’s applied poorly, measured worse, and therefore never corrected.

The Brynjolfsson paper offers a model of what AI measurement could be. Administrative data instead of surveys. Demographic granularity instead of aggregate averages. Rigorous controls instead of naive comparisons. Continuous monitoring instead of point-in-time evaluations.

No company has Stanford’s resources or access to ADP’s data. But the principles are transferable: segment, distinguish substitutive from augmentative use, control for confounding factors, monitor in real time. Those who adopt these principles will have an information advantage over those who continue tracking deployments and hours saved.

Sources

Brynjolfsson, E., Chandar, B., & Chen, R. (2025). Canaries in the Coal Mine: Six Facts about the Recent Employment Effects of AI. Stanford Digital Economy Lab.

Deloitte AI Institute. (2025). State of Generative AI in the Enterprise. Deloitte.

MIT Project NANDA. (2025). The GenAI Divide 2025. Massachusetts Institute of Technology.

MIT Sloan Management Review. (2024). The Future of Strategic Measurement: Enhancing KPIs With AI. MIT Sloan.

S&P Global Market Intelligence. (2025, October). Generative AI Shows Rapid Growth but Yields Mixed Results. S&P Global.