Tag: Singapore

  • Who Judges the Model? The Missing Layer in Sovereign AI Strategy

    By Amrita Sarkar- The Product Scientist

    Illustration depicting the concept of evaluation sovereignty in AI, featuring a bridge labeled 'The Sovereignty Gap' that contrasts imported benchmarks with sovereign evaluation infrastructure. The image includes elements such as the Golden Gate Bridge, diverse cultural landmarks, and key points about correct, safe, and relevant evaluation criteria.

    Over the past two years, governments across APAC and the GCC have announced sovereign AI strategies with genuine conviction: national language models, domestic data centers, local cloud regions, GPU clusters measured in the tens of thousands. The UAE’s Falcon models now lead the Open Arabic LLM Leaderboard. Saudi Arabia’s ALLaM anchors a national AI champion. India’s AI Mission has deployed tens of thousands of GPUs and seeded India-specific datasets. Singapore has built what may be the world’s most complete AI governance toolkit.

    And then, almost without exception, these sovereign systems are judged using foreign scorecards.

    We have localized the infrastructure, but not the judgment.

    That is the contradiction this essay is about. AI sovereignty is not only about where the model is hosted, where the GPUs sit, or whose cloud you use. It is also about who gets to decide whether the AI is good, safe, fair, useful, and legally acceptable. A country can build its own model, host it in its own data center, and regulate it under its own AI law — but if it still evaluates that model using benchmarks designed elsewhere, the final authority over “good AI” still sits elsewhere.

    No serious nation would say: “We will run our own economy, but another country can define inflation, risk, and acceptable volatility for us.” Monetary policy is sovereign because whoever sets the price of money shapes the economy. Benchmarks set the price of intelligence. Right now, most of the world imports that price.

    Benchmarks are not neutral

    Infographic comparing two evaluation paths: Imported Benchmark Path and Sovereign Evaluation Path, highlighting their key characteristics and final approval authorities.

    The hidden assumption in global AI is that a good benchmark is universal. It is not. No benchmark is neutral; every benchmark carries a worldview.

    Take MMLU, the most cited capability benchmark in the industry. It spans 57 tasks across the humanities, sciences, law, and history. That makes it useful — and also unmistakably a product of a particular academic and institutional context, much of it shaped by Western educational systems. HELM, Stanford’s holistic evaluation framework, is broader and more thoughtful, covering accuracy, calibration, robustness, fairness, and toxicity. Yet even HELM’s own authors acknowledge gaps: neglected dialects, underrepresented trustworthiness metrics.

    Let me be precise, because precision matters here. The problem is not that MMLU or HELM are bad. They have advanced the field enormously. The problem is that they were never designed as sovereign deployment instruments for APAC and GCC societies. They reflect the datasets, languages, institutions, harm taxonomies, and academic assumptions available to their creators. That makes them useful global baselines — and weak final arbiters of national AI readiness.

    A simple metaphor: a benchmark is a driving test. You would not certify a driver in Mumbai, Riyadh, or Singapore using only a California road test. The physics of driving is universal; the road conditions, signs, laws, and risks are local. AI is the same. General intelligence may be global. Deployment fitness is local.

    Illustration depicting a car labeled 'AI' undergoing a driving test benchmark in different locations: California (Pass), Mumbai, Riyadh, and Singapore (all marked as uncertain), with a note about general intelligence being global but deployment fitness being local.

    Why APAC and the GCC are different

    This is not an abstract concern, because the regions in question are not abstract markets.

    A benchmark assumption map displaying various evaluations such as local law, cultural context, and regulatory aspects for AI deployment with scores indicating levels of fit.

    Singapore is multilingual, trust-first, and financial-services-heavy. Its regulators have already conceded the core of this argument in writing: IMDA’s own testing guidance notes that bias and data-leakage evaluation are highly context-dependent and that generic public benchmarks will not cover them, pointing instead to Singapore-context datasets. The Monetary Authority of Singapore has proposed AI risk management guidelines for all financial institutions, covering AI inventories, lifecycle controls, and board accountability. When evaluation results need to function as supervisory evidence, an imported benchmark with no mapping to local regulatory expectations is not just insufficient — it is inadmissible.

    A data visualization panel displaying the jurisdiction readiness for Singapore, including local language requirements, data protection and AI governance, high-risk deployment sectors, cultural-context risk areas, required golden datasets, and judge model calibration. The panel indicates a sovereignty maturity score of 78 out of 100, categorized as mature.

    The UAE and Saudi Arabia have achieved something remarkable: genuine model sovereignty. Falcon-H1 Arabic tops the Open Arabic LLM Leaderboard. ALLaM was refined with input from more than six hundred domain experts and two hundred fifty evaluators — a national human-evaluation capability most countries cannot assemble. Notably, the Gulf has already proven the training-data version of this thesis: Falcon Arabic was deliberately built on native, non-translated Arabic spanning Modern Standard Arabic and regional dialects, because translated data misses the language. The same theorem holds for testing it. A sovereign model validated mostly on evaluation methodologies designed elsewhere is a sovereign asset with a foreign auditor.

    A dashboard displaying the jurisdiction readiness panel for the UAE, highlighting local language requirements, data protection, high-risk deployment sectors, cultural-context risk areas, required datasets, and judge model calibration metrics.

    India is the hardest test of the thesis and therefore its most persuasive case. Twenty-two scheduled languages, pervasive code-switching, and AI being wired into welfare delivery, agriculture advisories, courts, and credit through population-scale digital public infrastructure. A model scoring 90% on MMLU tells you almost nothing about whether it can safely advise a farmer in Bhojpuri, parse a vernacular loan document, or avoid caste- and community-specific harms that appear in no Western taxonomy. India’s own institutions have effectively conceded the premise: AIKosh exists precisely because Western, English-dominated datasets underperform in India’s multilingual context, and the IndiaAI Safety Institute was chartered to build indigenous safety research relevant to the developing world. The country that built UPI because foreign payment rails did not fit India should recognize evaluation as the next set of rails.

    A data visualization panel displaying jurisdiction readiness information for India, including local language requirements, data protection governance, sovereign maturity score, and high-risk deployment sectors.

    The sovereign evaluation stack

    Philosophy without architecture is commentary. Here is the system.

    A dashboard displaying the Sovereign AI Evaluation Control Plane metrics, including a Sovereignty Score of 74/100, a high Benchmark Dependency Risk, a Local Language Coverage of 61%, a Regulator Alignment of 82%, and a Sector Deployment Readiness rated as Medium.

    Domestic golden datasets. Not generic internet data — curated, regulator-reviewed, domain-specific evaluation sets. For Singapore: MAS compliance queries, PDPA consent scenarios, Singlish and multilingual customer-service cases, cross-border payments. For the Gulf: Arabic dialect handling, Islamic finance rules, government-service workflows, Arabic-English code-switching, culturally calibrated refusal behavior. For India: multilingual public-service queries, UPI fraud patterns, caste, gender, and religion bias tests, low-literacy user journeys. And these must be governed like living assets: versioned, owned, with explicit staleness criteria. A benchmark is a documented hypothesis about quality. Like any hypothesis, it needs falsifiable predictions and kill signals.

    Region-calibrated judge models. Modern evaluation increasingly runs on LLM-as-judge pipelines. But if the judge model is trained on external norms, the evaluation is still imported — you have localized the exam and imported the examiner. A sovereign judge model must understand local law, local language, local idiom, local risk thresholds, and local protected categories, and its agreement with human judgment must be audited against local annotators, not crowdworkers from another continent. The judge does not replace human review; it scales first-pass evaluation while staying aligned to the jurisdiction.

    Regulator-aligned rubrics. Every metric should trace to an actual supervisory expectation — MAS guidelines and PDPA in Singapore; DIFC, ADGM, and data protection regimes in the Gulf; RBI, sectoral regulators, and the DPDP Act in India. This is what converts evaluation output from a research artifact into supervisory evidence.

    Sector-specific failure libraries. A bad AI answer in a shopping app is not the same as a bad AI answer in a bank, a hospital, a court, or a benefits system. Sovereign evaluation should maintain catalogued failure modes by sector: fraud, AML, and suitability in banking; triage boundaries and clinical hallucination in healthcare; citation accuracy and jurisdictional mismatch in legal AI; eligibility and appeal pathways in government services.

    Infographic depicting the five-layer sovereign evaluation stack, including layers for Domestic Golden Datasets, Region-Calibrated Judge Models, Regulator-Aligned Rubrics, Sector Failure Library, and Deployment Readiness Gate, with respective key components listed under each layer.

    The business implication

    This is not only an agenda for governments. A global bank deploying generative AI in Singapore, Dubai, Riyadh, and Mumbai cannot rely on one universal safety score. It needs jurisdiction-specific evaluation packs — the same model, certified differently for each regulatory and linguistic environment it operates in. The institutions that build this capability early will clear regulatory review faster, deploy with more confidence, and carry measurably less model risk than competitors still waving a single global leaderboard score at four different supervisors.

    Screenshot of a recommendation engine interface listing six directives for evaluating AI systems, including generating an evaluation sovereignty report.

    There is also a market being born here. Someone will build the assurance layer for sovereign AI — the testing firms, the judge-model auditors, the certification regimes. The regions that built the models should not leave the trust layer to be built elsewhere. Whoever defines “trustworthy AI” for a region owns the most defensible position in its AI economy.

    The closing argument

    Sovereign AI without sovereign evaluation is branding, not sovereignty. A nation that imports AI evaluation imports AI judgment. And the next geopolitical layer of AI will not be model weights — it will be benchmark power.

    The countries that win sovereign AI will not simply be the ones with the most compute. They will be the ones that can prove, in their own languages and under their own laws, that an AI system is safe enough, useful enough, and accountable enough to serve their people.

    The models are getting built. The question that remains is older than AI and more important than any leaderboard: who judges the model?

    I have designed a dashboard to walk through the steps to build a sovereign evaluation layer for three specific geographies mentioned in this post. Try out the dashboard here – Sovereign Eval Dashboard

  • The PDPA-Aware Data Product Canvas-

    Architecting for Consent-Led AI Ingestion

    By Amrita Sarkar  ·  The Product Scientist

    You cannot build a defensible AI strategy on a broken data foundation. As Singapore and the wider APAC region move from data-protection principles to active enforcement, the enterprises that win at AI will be the ones that treat consent, lineage, and purpose as engineered attributes of the data itself — not as paperwork filed somewhere upstream

    A regional platform decides to launch an AI assistant across Singapore, India, Indonesia, and Malaysia. The model is good. The demo lands. Leadership is ready to ship.

    Then launch stalls — not in engineering, but in the review room. Nobody can answer the questions that suddenly matter most. Which customer data entered the vector store? Was any of it consented for AI retrieval, or only for the original transaction? Can it legally move across all four markets? What happens when a customer withdraws consent — does the model forget? And if a regulator asks which source informed a specific recommendation, can the company prove it?

    The model works. The data foundation cannot defend itself. This is the pattern I keep seeing in regulated APAC platforms, and it is why so many enterprise AI pilots die quietly between an impressive demo and a production launch that never clears review. Enterprise AI does not fail only because models hallucinate. It fails because the data foundation cannot prove what the model was allowed to know.

    The foundation was never built for this

    Most enterprises are rushing to build AI on top of the data they already have: the lake, the CRM, customer records, call transcripts, product events, document stores. The reflex is understandable. The data is there, the models are available, and the board wants a strategy. But the foundation underneath was designed for a different era and a different consumer. It was built for reporting, dashboards, analytics, compliance logs, and operational workflows — systems where a known human runs a known query for a known purpose.

    Generative AI changes the risk profile entirely, because the data is no longer merely queried. It is retrieved, embedded, summarised, reasoned over, recombined with other data, and turned into decisions and recommendations — often without a human in the loop and often in ways nobody anticipated at the point of collection. The same dataset that was safe in a quarterly report becomes hazardous the moment it is vectorised and made retrievable by an autonomous agent.

    Dashboard displaying metrics for consent-led AI ingestion, including consent coverage, data sources, retrieval rate, and unreviewed AI retrievals. Features sections for immature AI flow and PDPA-aware data ingestion processes.

    Before generative AI, weak data governance produced a bounded failure: a bad dashboard, a misleading metric, a report someone had to correct. The blast radius was small. Now the same weakness produces a different class of failure:

    • unauthorised personal data entering embeddings, where it is difficult to find and harder to remove;
    • customer information surfaced in the wrong context to the wrong user;
    • consent quietly violated, invisibly, at machine speed;
    • cross-border data movement with no enforceable control;
    • model outputs that no one can audit or explain after the fact;
    • and AI pilots that are technically excellent yet never survive compliance review.

    The conclusion follows directly. In regulated industries, the AI failure is frequently not a model problem at all. It is a data product problem wearing a model’s clothing.

    The model is only the visible layer. The real enterprise moat is the consent-aware data substrate underneath it

    Dashboard displaying metrics for consent-led AI ingestion, highlighting consent coverage, approval rates, blocked risky retrievals, and audit readiness with an emphasis on governance and data quality.

    Governed data products, not data dumps

    The dominant enterprise AI narrative is “connect all our data to an LLM.” The more defensible narrative is the opposite: turn every high-value dataset into a governed data product before any AI system is allowed to consume it.

    A governed data product is a dataset that carries its own passport. It can answer, of itself, where it came from, who consented to its use, what that consent was for, whether it is accurate and fresh, whether it may cross a border, and which classes of AI use — retrieval, personalisation, training, decisioning — it is cleared for. The dataset stops being raw material that each team re-evaluates from scratch and becomes a reusable, trusted, policy-aware asset with a known risk profile.

    This is a product-management move more than a legal one. A product has an owner, a defined purpose, quality standards, a lifecycle, and consumers with expectations. The discipline that makes a good product — clarity about who it serves and what it is allowed to do — is exactly the discipline missing from the average data lake, which is optimised for accumulation rather than accountability.

    A flow map illustrating the architecture stages from raw data sources to a governed AI substrate, including stages like governed data products, consent and lineage checks, approved retrieval layer, AI system, and audit trail.

    Consent is a data attribute, not a checkbox

    This is the heart of the thesis and the place where most organisations are furthest behind. Today, consent is treated as a legal event: the user clicked accept, the form was submitted, the policy was acknowledged, and a row was written somewhere. The event passes, and the consent is presumed to persist as a static fact in a system the data pipeline never consults.

    In AI-native platforms, that model breaks. Consent has to become runtime metadata — a live attribute that travels with the data wherever it flows and that the pipeline can read and enforce at the moment of use. Consider a single, ordinary case.

    A customer consents to her transaction history being used for fraud monitoring. That does not mean the same data can be used to train a marketing personalisation model. It does not mean it can be sent to another country. It does not mean it can be embedded into a vector database and retrieved by a support agent answering an unrelated question. These are four different purposes with four different risk profiles, and one consent event cannot stand in for all of them.

    In AI-native platforms, consent cannot live in a PDF, a policy page, or a CRM note. It has to become a queryable, enforceable, machine-readable attribute attached to every data product.
    Dashboard for Customer Transaction Data Product displaying details such as data owner, AI usage options, quality score, revocation handling, purpose, residency rule, lineage, and consent scope.

    The hardest edge of this is revocation. When a customer withdraws consent, a compliant system must do more than stop new processing; it must propagate that withdrawal into places the original architects never planned for — most painfully, into embeddings already written to a vector store. “Forgetting” data that has been encoded into a model’s retrieval layer is a genuine engineering problem, not a policy footnote, and a platform that cannot do it has not actually honoured the withdrawal. There is a regulatory signal worth reading carefully here. When Singapore’s Personal Data Protection Commission published its Advisory Guidelines on the Use of Personal Data in AI Recommendation and Decision Systems in March 2024, it confirmed that consent and notification obligations apply to AI systems unless a specific exception is available — but it deliberately limited its scope to recommendation and decision systems and did not address the training and deployment of generative AI. The guidance is not legally binding, yet the Commission has signalled it will enforce in a manner consistent with it. The practical meaning for product leaders is twofold: the direction of travel on consent is unambiguous, and the embedding-and-retrieval layer at the centre of generative AI sits on a frontier that formal guidance has not yet fully mapped. That is precisely the territory a consent-led data architecture is built to govern — ahead of the rules, rather than scrambling behind them

    A screenshot of a policy engine dashboard displaying the status of various compliance checks, including consent status, purpose match, data freshness, cross-border transfer review, sensitive attribute detection, retrieval permission, and training permission.

    The PDPA-Aware Data Product Canvas

    If a governed data product needs a passport, the canvas is the form that issues it. It is a product canvas, but for datasets. A conventional product canvas asks who the user is, what problem we are solving, what the value proposition is, and how we will measure success. A data product canvas asks a parallel set of questions about permission and provenance: what is this dataset allowed to do, who granted that permission, what are the limits, which AI systems may consume it, what happens when consent changes, how do we prove lineage, and how do we prevent misuse.

    Canvas BlockThe question it forcesWhy it matters
    Dataset PurposeWhat business or AI use case is this data product actually for?Kills the “collect everything, decide later” reflex that makes governance impossible downstream.
    Source LineageWhere did this data originate, and through what path did it arrive?Lets you prove provenance and trace any AI output back to the inputs that shaped it.
    Consent MetadataWhat did the individual actually agree to?Turns consent from buried legal text into a machine-readable signal the pipeline can enforce.
    Purpose BoundaryIs this cleared for analytics, retrieval, personalisation, training, or decisioning?Different AI uses carry different risk. One consent does not unlock all of them.
    Data Quality ScoreIs it complete, accurate, fresh, and fit for use?Poor inputs produce confidently wrong AI behaviour that is expensive to detect.
    Retrieval FidelityWhen AI retrieves this, does it get the right chunk, version, and context?Prevents hallucination born of stale, partial, or mismatched data.
    Residency & TransferCan this data cross borders or tenants?Decisive for any ASEAN-scale platform operating under divergent transfer regimes.
    Expiry & RevocationWhat happens when consent is withdrawn or the data expires?AI systems need real deletion and suppression workflows, not a flag in a CRM.
    Human EscalationWhen should this data not be used automatically?Keeps sensitive or ambiguous cases out of blind automation.
    Audit EvidenceCan we prove what data was used, when, and why?Converts compliance from a cost centre into trust infrastructure.
    A digital interface displaying an AI model's data layer governance, emphasizing the importance of consent and metadata in enterprise AI applications.

    Read the canvas a second time and a pattern surfaces. It is not a compliance checklist bolted onto a data platform; it is the substance of Singapore’s PDPA re-expressed as product specifications. Purpose Boundary operationalises the Purpose Limitation Obligation. Consent Metadata makes the Consent and Notification Obligations enforceable at runtime rather than merely asserted in a policy document. Residency and Transfer Rules encode the Transfer Limitation Obligation. Expiry and Revocation Logic gives practical effect to consent withdrawal and the Retention Limitation Obligation. Audit Evidence is the Accountability Obligation made queryable. The canvas does not ask a product leader to memorise the statute. It asks a more useful question of every dataset: what does the law already require of this data, and how do we build that requirement into the asset itself?

    The architecture, in plain terms

    Picture the AI platform of a bank, a super-app, an insurer, or a healthcare provider. The immature architecture, which is also the common one, looks like this:

    Immature  Data lake → embeddings → LLM → user answer

    Every governance question — consent, purpose, residency, lineage, revocation — is either asked too late or not asked at all, because there is no point in the flow designed to ask it. The data moves from storage to model with nothing in between to check what it is allowed to do.

    A defensible architecture inserts the missing layer:

    Governed  Raw data → governed data product → consent + lineage + quality gate → approved retrieval → AI system → audit trail

    The governed data product is where the canvas is applied. The consent, lineage, and quality gate is where a dataset’s passport is checked before it is allowed to travel. The approved retrieval layer ensures the model can only reach data cleared for the use at hand. The audit trail closes the loop, so that after any output the platform can reconstruct what data was used, under what permission, and why. Nothing in this flow slows the model. It governs what reaches the model — which is a different and far more tractable thing to control.

    Why this is an ASEAN advantage, not an ASEAN tax

    The obvious version of this argument is “you need data governance before AI.” True, and unremarkable. The sharper version is that, in this region and at this moment, consent-led data liquidity is becoming a competitive advantage in its own right.

    The regulatory clock is no longer abstract. Singapore’s PDPA already imposes accountability, notification, consent, purpose limitation, and transfer obligations on how organisations handle personal data, and the PDPC has extended that thinking explicitly into AI systems. India’s Digital Personal Data Protection regime moved from statute to operational reality when the DPDP Rules were notified in November 2025, with core obligations on consent and data-fiduciary duties enforceable from May 2027 and penalties reaching billions of rupees per breach. A regional platform serving Singapore, India, Indonesia, and Malaysia is now operating across regimes that are converging on the same demands — provable consent, purpose limitation, controlled transfer, and the right to withdraw — on a timeline measured in months, not years.

    Against that backdrop, the instinct to treat governance as a brake is exactly backwards. The team that has to renegotiate risk from scratch for every new AI use case is the slow team. The team with a library of pre-approved, consent-aware data products can move a dataset into a new AI use case in days, because the hard questions were answered once, at the source, and travel with the data.

    Governance is not the brake. Bad governance is the brake. Good governance creates reusable, pre-approved data products that accelerate AI delivery rather than stall it

    This is the inversion worth internalising. In regulated APAC markets, compliance stops being a legal afterthought and becomes a product capability — and the company that can prove safe AI ingestion will win enterprise adoption faster than the company that merely demonstrates a clever model. The moat is not the model. It is the substrate that lets the model be trusted.

    The traveller at the border

    Infographic outlining the PDPA-aware data product canvas for architecting consent-led AI ingestion, highlighting the risks of traditional data ingestion methods and presenting a governed, traceable, and scalable solution.

    There is a useful way to picture what a governed data product really is. A dataset entering an AI system should behave like a traveller entering Changi: it needs an identity, a verifiable origin, a declared destination, a permitted purpose, security clearance, an expiry, and a trail that can be followed. Strip those away and the AI system becomes a borderless zone where sensitive data moves faster than governance can follow — which is exactly the condition that turns an impressive pilot into a stalled launch.

    The first generation of enterprise AI leaders asked a single question of their systems: can the model answer this? The next generation will ask three harder ones.

    Was the model allowed to know this? Can we prove why it knew it? And can we revoke that knowledge when the person changes their mind?

    In regulated APAC markets, those questions are not compliance overhead. They are the foundation of defensible AI — and the work of building that foundation belongs not to the legal team after launch, but to the product leader before the first dataset is ever embedded.

    I have built a dashboard to visualize this thesis – you can try it here –

    https://consent-aware-ai-canvas.lovable.app