Evals – The Product Scientist

By Amrita Sarkar- The Product Scientist

Illustration depicting the concept of evaluation sovereignty in AI, featuring a bridge labeled 'The Sovereignty Gap' that contrasts imported benchmarks with sovereign evaluation infrastructure. The image includes elements such as the Golden Gate Bridge, diverse cultural landmarks, and key points about correct, safe, and relevant evaluation criteria.

Over the past two years, governments across APAC and the GCC have announced sovereign AI strategies with genuine conviction: national language models, domestic data centers, local cloud regions, GPU clusters measured in the tens of thousands. The UAE’s Falcon models now lead the Open Arabic LLM Leaderboard. Saudi Arabia’s ALLaM anchors a national AI champion. India’s AI Mission has deployed tens of thousands of GPUs and seeded India-specific datasets. Singapore has built what may be the world’s most complete AI governance toolkit.

And then, almost without exception, these sovereign systems are judged using foreign scorecards.

We have localized the infrastructure, but not the judgment.

That is the contradiction this essay is about. AI sovereignty is not only about where the model is hosted, where the GPUs sit, or whose cloud you use. It is also about who gets to decide whether the AI is good, safe, fair, useful, and legally acceptable. A country can build its own model, host it in its own data center, and regulate it under its own AI law — but if it still evaluates that model using benchmarks designed elsewhere, the final authority over “good AI” still sits elsewhere.

No serious nation would say: “We will run our own economy, but another country can define inflation, risk, and acceptable volatility for us.” Monetary policy is sovereign because whoever sets the price of money shapes the economy. Benchmarks set the price of intelligence. Right now, most of the world imports that price.

Benchmarks are not neutral

Infographic comparing two evaluation paths: Imported Benchmark Path and Sovereign Evaluation Path, highlighting their key characteristics and final approval authorities.

The hidden assumption in global AI is that a good benchmark is universal. It is not. No benchmark is neutral; every benchmark carries a worldview.

Take MMLU, the most cited capability benchmark in the industry. It spans 57 tasks across the humanities, sciences, law, and history. That makes it useful — and also unmistakably a product of a particular academic and institutional context, much of it shaped by Western educational systems. HELM, Stanford’s holistic evaluation framework, is broader and more thoughtful, covering accuracy, calibration, robustness, fairness, and toxicity. Yet even HELM’s own authors acknowledge gaps: neglected dialects, underrepresented trustworthiness metrics.

Let me be precise, because precision matters here. The problem is not that MMLU or HELM are bad. They have advanced the field enormously. The problem is that they were never designed as sovereign deployment instruments for APAC and GCC societies. They reflect the datasets, languages, institutions, harm taxonomies, and academic assumptions available to their creators. That makes them useful global baselines — and weak final arbiters of national AI readiness.

A simple metaphor: a benchmark is a driving test. You would not certify a driver in Mumbai, Riyadh, or Singapore using only a California road test. The physics of driving is universal; the road conditions, signs, laws, and risks are local. AI is the same. General intelligence may be global. Deployment fitness is local.

Illustration depicting a car labeled 'AI' undergoing a driving test benchmark in different locations: California (Pass), Mumbai, Riyadh, and Singapore (all marked as uncertain), with a note about general intelligence being global but deployment fitness being local.

Why APAC and the GCC are different

This is not an abstract concern, because the regions in question are not abstract markets.

A benchmark assumption map displaying various evaluations such as local law, cultural context, and regulatory aspects for AI deployment with scores indicating levels of fit.

Singapore is multilingual, trust-first, and financial-services-heavy. Its regulators have already conceded the core of this argument in writing: IMDA’s own testing guidance notes that bias and data-leakage evaluation are highly context-dependent and that generic public benchmarks will not cover them, pointing instead to Singapore-context datasets. The Monetary Authority of Singapore has proposed AI risk management guidelines for all financial institutions, covering AI inventories, lifecycle controls, and board accountability. When evaluation results need to function as supervisory evidence, an imported benchmark with no mapping to local regulatory expectations is not just insufficient — it is inadmissible.

A data visualization panel displaying the jurisdiction readiness for Singapore, including local language requirements, data protection and AI governance, high-risk deployment sectors, cultural-context risk areas, required golden datasets, and judge model calibration. The panel indicates a sovereignty maturity score of 78 out of 100, categorized as mature.

The UAE and Saudi Arabia have achieved something remarkable: genuine model sovereignty. Falcon-H1 Arabic tops the Open Arabic LLM Leaderboard. ALLaM was refined with input from more than six hundred domain experts and two hundred fifty evaluators — a national human-evaluation capability most countries cannot assemble. Notably, the Gulf has already proven the training-data version of this thesis: Falcon Arabic was deliberately built on native, non-translated Arabic spanning Modern Standard Arabic and regional dialects, because translated data misses the language. The same theorem holds for testing it. A sovereign model validated mostly on evaluation methodologies designed elsewhere is a sovereign asset with a foreign auditor.

A dashboard displaying the jurisdiction readiness panel for the UAE, highlighting local language requirements, data protection, high-risk deployment sectors, cultural-context risk areas, required datasets, and judge model calibration metrics.

India is the hardest test of the thesis and therefore its most persuasive case. Twenty-two scheduled languages, pervasive code-switching, and AI being wired into welfare delivery, agriculture advisories, courts, and credit through population-scale digital public infrastructure. A model scoring 90% on MMLU tells you almost nothing about whether it can safely advise a farmer in Bhojpuri, parse a vernacular loan document, or avoid caste- and community-specific harms that appear in no Western taxonomy. India’s own institutions have effectively conceded the premise: AIKosh exists precisely because Western, English-dominated datasets underperform in India’s multilingual context, and the IndiaAI Safety Institute was chartered to build indigenous safety research relevant to the developing world. The country that built UPI because foreign payment rails did not fit India should recognize evaluation as the next set of rails.

A data visualization panel displaying jurisdiction readiness information for India, including local language requirements, data protection governance, sovereign maturity score, and high-risk deployment sectors.

The sovereign evaluation stack

Philosophy without architecture is commentary. Here is the system.

A dashboard displaying the Sovereign AI Evaluation Control Plane metrics, including a Sovereignty Score of 74/100, a high Benchmark Dependency Risk, a Local Language Coverage of 61%, a Regulator Alignment of 82%, and a Sector Deployment Readiness rated as Medium.

Domestic golden datasets. Not generic internet data — curated, regulator-reviewed, domain-specific evaluation sets. For Singapore: MAS compliance queries, PDPA consent scenarios, Singlish and multilingual customer-service cases, cross-border payments. For the Gulf: Arabic dialect handling, Islamic finance rules, government-service workflows, Arabic-English code-switching, culturally calibrated refusal behavior. For India: multilingual public-service queries, UPI fraud patterns, caste, gender, and religion bias tests, low-literacy user journeys. And these must be governed like living assets: versioned, owned, with explicit staleness criteria. A benchmark is a documented hypothesis about quality. Like any hypothesis, it needs falsifiable predictions and kill signals.

Region-calibrated judge models. Modern evaluation increasingly runs on LLM-as-judge pipelines. But if the judge model is trained on external norms, the evaluation is still imported — you have localized the exam and imported the examiner. A sovereign judge model must understand local law, local language, local idiom, local risk thresholds, and local protected categories, and its agreement with human judgment must be audited against local annotators, not crowdworkers from another continent. The judge does not replace human review; it scales first-pass evaluation while staying aligned to the jurisdiction.

Regulator-aligned rubrics. Every metric should trace to an actual supervisory expectation — MAS guidelines and PDPA in Singapore; DIFC, ADGM, and data protection regimes in the Gulf; RBI, sectoral regulators, and the DPDP Act in India. This is what converts evaluation output from a research artifact into supervisory evidence.

Sector-specific failure libraries. A bad AI answer in a shopping app is not the same as a bad AI answer in a bank, a hospital, a court, or a benefits system. Sovereign evaluation should maintain catalogued failure modes by sector: fraud, AML, and suitability in banking; triage boundaries and clinical hallucination in healthcare; citation accuracy and jurisdictional mismatch in legal AI; eligibility and appeal pathways in government services.

Infographic depicting the five-layer sovereign evaluation stack, including layers for Domestic Golden Datasets, Region-Calibrated Judge Models, Regulator-Aligned Rubrics, Sector Failure Library, and Deployment Readiness Gate, with respective key components listed under each layer.

The business implication

This is not only an agenda for governments. A global bank deploying generative AI in Singapore, Dubai, Riyadh, and Mumbai cannot rely on one universal safety score. It needs jurisdiction-specific evaluation packs — the same model, certified differently for each regulatory and linguistic environment it operates in. The institutions that build this capability early will clear regulatory review faster, deploy with more confidence, and carry measurably less model risk than competitors still waving a single global leaderboard score at four different supervisors.

Screenshot of a recommendation engine interface listing six directives for evaluating AI systems, including generating an evaluation sovereignty report.

There is also a market being born here. Someone will build the assurance layer for sovereign AI — the testing firms, the judge-model auditors, the certification regimes. The regions that built the models should not leave the trust layer to be built elsewhere. Whoever defines “trustworthy AI” for a region owns the most defensible position in its AI economy.

The closing argument

Sovereign AI without sovereign evaluation is branding, not sovereignty. A nation that imports AI evaluation imports AI judgment. And the next geopolitical layer of AI will not be model weights — it will be benchmark power.

The countries that win sovereign AI will not simply be the ones with the most compute. They will be the ones that can prove, in their own languages and under their own laws, that an AI system is safe enough, useful enough, and accountable enough to serve their people.

The models are getting built. The question that remains is older than AI and more important than any leaderboard: who judges the model?

I have designed a dashboard to walk through the steps to build a sovereign evaluation layer for three specific geographies mentioned in this post. Try out the dashboard here – Sovereign Eval Dashboard

Tag: Evals

Who Judges the Model? The Missing Layer in Sovereign AI Strategy

Benchmarks are not neutral

Why APAC and the GCC are different

The sovereign evaluation stack

The business implication

The closing argument