From Public Medical Knowledge to Sovereign Clinical Intelligence
A framework for doctor only national LLMs
Abstract
Large language models are already demonstrating utility in medicine, but their clinical ceiling is becoming clearer. Models trained predominantly on public domain information:
- including biomedical literature, guidelines, educational resources, and public facing medical databases
- can summarize, explain, and reason across formal knowledge, yet they do not fully capture how medicine is actually practiced.
This article argues that the next major advance in clinically useful medical AI will not come from scale alone, but from country specific, doctor only clinical language models that integrate public biomedical knowledge with tightly governed national non public clinical data.
The objective is not autonomous diagnosis, but augmented clinical decision making under physician supervision.
This paper proposes a practical framework for such systems, centered on lawful secondary use of sensitive data, episode level structuring of clinical information, rotating physician validation, staged benchmarking from technical performance to patient outcomes, and continuous improvement under controlled governance.
It further argues that while discussion and methodological exchange may be international, implementation must be national because privacy law, consent models, health system architecture, and data sovereignty differ by jurisdiction.
On current evidence, countries with strong state capacity, integrated healthcare infrastructure, and credible health data governance are best placed to build such systems first.[1][2]
Author's note
This piece is a perspective and implementation framework, not a completed clinical study. It is intended to stimulate informed debate, stress test the model, and invite collaboration from doctors, clinical academics, informaticians, regulators, and health system leaders.
Introduction
Large language models have changed the practical conversation around artificial intelligence in medicine. They can already summarize evidence, generate explanations, assist with differential diagnosis, draft patient facing materials, and support administrative work. But their limitations in serious clinical use are increasingly visible. The problem is not only hallucination. It is that these systems are largely shaped by the boundaries of public information. Public medical data are essential, but they are incomplete.
Medicine is not practiced only in journals, guidelines, or public databases. A large share of clinically valuable reasoning exists in non public settings: in patient records, in multidisciplinary cancer meetings, in specialty registries, in referral trajectories, in follow up outcomes, in case presentations, and in the discussions physicians have with one another when confronting uncertainty. WHO guidance on large multimodal models in health emphasizes that value depends not just on capability, but on governance, human oversight, accountability, and safe use in real clinical contexts.[1]
The next leap in medical AI may therefore depend less on building ever-larger general-purpose models and more on whether health systems can lawfully and ethically convert their own non-public clinical experience into doctor-facing intelligence that measurably improves patient outcomes. That is the focus of this article.
A recent Nature Medicine evaluation sharpens this problem. In a structured stress test of ChatGPT Health, investigators used 60 clinician-authored vignettes across 21 clinical domains and 16 factorial conditions, generating 960 responses. They reported an inverted U-shaped error pattern, with some of the most dangerous failures concentrated at the clinical extremes: nonurgent presentations and emergency cases.[11]
Most importantly, among gold-standard emergencies, the system under-triaged 52% of cases, sometimes directing patients with diabetic ketoacidosis or impending respiratory failure toward delayed evaluation rather than the emergency department. The same study also found that anchoring cues from family or friends could shift recommendations toward less urgent care in edge cases, and that crisis-intervention messages for suicidal ideation activated inconsistently.[11]
These findings do not prove that sovereign doctor-only clinical LLMs are the answer. They do, however, reinforce the central argument of this article: broad consumer-facing medical LLM deployment without rigorous prospective validation is unsafe, and the next serious phase of medical AI should be clinician-facing, tightly governed, and evaluated against real workflow performance and patient outcomes rather than surface fluency alone. The study’s own data were based on synthetic clinician-authored vignettes rather than real longitudinal patient records, which underscores the gap this article is trying to address: the missing bridge between public medical knowledge and governed, outcome-linked clinical reality.[11]
The ceiling of public-data medical AI
Public biomedical knowledge remains the necessary foundation for any serious clinical model. Literature, guidelines, trial registries, formularies, adverse-event bulletins, and structured public datasets provide a broad and legally accessible base. But public-domain corpora omit much of the tacit and operational reasoning that drives real-world care.
They do not fully encode what happens when a clinician revises a differential after subtle deterioration, when a tumor board rejects an otherwise guideline-concordant plan, when a rare disease is recognized because a pattern feels wrong, or when a discharge note reflects a care tradeoff never made explicit in formal literature. The result is that current models are often strongest where medicine is most explicit and weakest where medicine is contextual, longitudinal, and uncertain. That is not a trivial flaw. It is the boundary between performing well on medical questions and becoming genuinely useful in clinical decision support.
The missing layer: non-public clinical data
The missing layer is not one dataset. It is a class of information distributed across national health systems in fragmented form. It includes at least six high-value sources:
- patient records containing longitudinal histories, investigations, management changes, and eventual outcomes
- disease and specialty registries that can connect decisions to population-level outcomes
- multidisciplinary team records that preserve complex expert reasoning and disagreement
- morbidity and mortality reviews that convert hindsight into clinically relevant learning
- specialty case conferences and interest groups that preserve tacit reasoning rarely published
- referral and discharge pathways that reveal how decisions evolve across institutions and over time
None of this means these sources are automatically trustworthy. Real-world clinical data are often messy, biased, incomplete, and inconsistently recorded. Informal case discussion may be rich in expertise but poor in standardization. Notes may reflect defensive medicine, local culture, or low-quality documentation. OECD work on secondary use of health data makes clear that the issue is not simply whether valuable data exist, but whether they can be lawfully accessed, safely governed, linked in meaningful ways, and used in the public interest.[2]
That means the central challenge is not data access alone. It is clinical evidence conversion.
From raw clinical activity to usable supervisory signal
A country seeking to build a doctor-only clinical LLM should not think in terms of documents. It should think in terms of clinical episodes.
A useful training and evaluation unit is not a free-text note in isolation, but a structured record of:
- the patient state at a given decision point
- the key information available at that time
- the differential diagnoses considered
- the action taken
- the reasoning or uncertainty attached to that action
- the subsequent outcome over a defined follow-up period
That is the key conversion step. Raw records need to be transformed into outcome-linked episodes that can teach a model not just what clinicians said, but what happened next.
The national pipeline for this looks like: ingest, de-identify or pseudonymize, normalize terminology, link longitudinal records, extract episode structure, attach outcome labels, quality-score the record, and submit high-value samples to expert adjudication. The reason this matters is simple: a model cannot reliably improve clinical decision-making if it is trained mainly on language without validated links to consequence.
A framework for sovereign doctor-only national LLMs
The right implementation model is not global. It is national.
International discussion can and should occur. Methods can be shared. Benchmark design can be shared. Governance patterns can be compared. But the actual execution must be country-based because law, trust, privacy, health-system structure, professional regulation, and data sovereignty are jurisdiction-dependent.
A serious national framework has five layers.
1. Public medical foundation layer
The base model continues to learn from lawful public medical information: literature, guidelines, trial registries, labels, safety alerts, and public educational material.
2. National non-public clinical layer
A country-specific consortium - likely involving government, regulators, health systems, specialty bodies, and AI developers - establishes lawful access to non-public data sources including EMRs, registries, MDT records, and other specialist materials.
3. Clinical evidence-conversion layer
Raw records are converted into episode-level, outcome-linked case structures rather than dumped into a model as undifferentiated text.
4. Physician verification layer
A large rotating physician workforce validates cases, identifies model errors, adjudicates disagreements, and continually improves the quality of supervision. Rotation matters. The same experts should not control the validation layer indefinitely. Real medicine changes, and the human oversight layer must keep pace with living practice.
5. Controlled deployment layer
The final system is doctor-facing only, with credentialed access, audit logs, specialty-specific deployment rules, explicit uncertainty handling, and clear boundaries that preserve physician responsibility. WHO's guidance supports this kind of structured governance and human oversight model rather than open-ended autonomous deployment.[1]
How to know whether it works
This is the decisive question.
The goal is not to make the model sound more medical. It is not to make answers longer. It is not to impress physicians with apparent sophistication. The goal is to determine whether the integration of non-public clinical data produces better patient-relevant outcomes.
That requires three levels of evaluation:
Technical uplift
Does the model improve on adjudicated benchmarks for diagnosis ranking, triage safety, test selection, omission of red flags, guideline-aware reasoning, and calibration?
Workflow uplift
In controlled clinical tasks, does it help physicians reach safer or faster decisions, reduce unnecessary investigations, improve recognition of overlooked possibilities, or support less experienced clinicians without degrading senior judgement?
Patient outcome uplift
In defined use cases, does model-assisted care improve what actually matters: time to correct diagnosis, avoidable deterioration, readmission, complication rates, length of stay, or other clinically meaningful outcomes?
This final layer is the real test. It is also the slowest and hardest. But without it, the system remains an intriguing tool rather than a justified clinical intervention.
Continuous learning without uncontrolled drift
No Phase 2 system will be perfect at launch. It should not pretend to be. The point is not one training run, but a governed cycle of continuous improvement.
That cycle should be: capture, clean, link, label, benchmark, deploy narrowly, monitor, audit harms, retrain selectively, and re-test. Continuous learning in medicine must not mean uncontrolled self-reinforcement. It should mean continuous improvement under explicit oversight.[1]
Some elements should update rapidly: benchmark sets, error taxonomies, retrieval corpora, specialty guidance overlays, and data-quality weights. Other elements should update slowly and under strict controls: core model weights, specialty permissions, and safety thresholds. A clinical model should learn, but it should not drift casually.
Why implementation is national
The conversation about this model can be international. The execution cannot.
A country that builds such a system would be making a sovereign decision about the secondary use of health data, physician access rules, ethical safeguards, and accountability structures. That is why the most realistic future is not one global medical LLM, but multiple national doctor-only systems built on a shared international conversation about methods.
Countries differ sharply in their ability to do this.
- China stands out for execution speed, central coordination, and active medical-AI deployment. Shanghai's 2025-2027 medical AI plan explicitly targets healthcare data infrastructure and medical AI development, while Shenzhen reported widespread deployment of AI medical products across healthcare institutions.[8][9] China's PIPL treats medical health data as sensitive personal information and requires specific purpose, necessity, and protective measures for processing, with separate consent generally required for sensitive personal information.[10]
- Singapore is unusually strong on governance clarity and health-data architecture. Its Health Information Bill and broader Health Information Act framework are designed to support coordinated care through broad contribution of key patient information to the National Electronic Health Record.[3][4]
- The United Kingdom benefits from the scale and integration of the NHS, and the Federated Data Platform is creating national infrastructure to connect health data under NHS control.[5][6]
- Australia has strong clinical capability but remains structurally more fragmented; the 2025 federal review concluded that the current regulatory environment for AI in health care is not fit for purpose in its present form.[7]
Taken together, these examples show why the question is less about who has the biggest model today and more about who can build the strongest national clinical evidence-conversion system around one.
Conclusion
The next major advance in medical AI may not come from larger general-purpose models alone. It may come from whether nations can transform non-public clinical experience into validated, continuously improving doctor-facing intelligence.
That transformation will not be achieved by simply feeding more secret data into a model. It will require lawful access, outcome linkage, structured episode construction, rotating physician adjudication, staged benchmarking, and ongoing patient-outcome measurement. The countries best placed to do this first will not necessarily be the ones with the best models today. They will be the ones able to build the best clinical evidence-conversion systems around those models.
This is not a claim that such systems already work. It is a claim that the path is visible, that the missing layer in current medical AI is increasingly obvious, and that the next serious conversation should be about how to build this responsibly at country level before someone builds it irresponsibly at scale.
Call for collaborators
Call for collaborators: I am seeking doctors, clinical researchers, medical informaticians, and health-policy experts interested in refining this framework into a journal-ready perspective or policy paper. If you work in clinical governance, registries, MDT workflows, EMR systems, medical ethics, or outcome evaluation, this article is meant to be challenged, improved, and made more rigorous.
References
1. World Health Organization. Ethics and governance of artificial intelligence for health: Guidance on large multimodal models. https://www.who.int/publications/i/item/9789240084759
2. OECD. Facilitating the secondary use of health data for public interest purposes across borders. https://www.oecd.org/en/publications/facilitating-the-secondary-use-of-health-data-for-public-interest-purposes-across-borders_d7b90d15-en.html
3. Singapore Ministry of Health. Health Information Bill to support coordinated care across Singapore's healthcare ecosystem. https://www.moh.gov.sg/newsroom/health-information-bill-to-support-coordinated-care-across-singapore-s-healthcare-ecosystem/
4. HealthInfo.gov.sg. Health Information Act overview. https://healthinfo.gov.sg/
5. NHS England. NHS Federated Data Platform FAQs. https://www.england.nhs.uk/digitaltechnology/nhs-federated-data-platform/fdp-faqs/
6. NHS England. Federated Data Platform. https://digital.nhs.uk/services/federated-data-platform
7. Australian Government Department of Health and Aged Care. Safe and Responsible Artificial Intelligence in Health Care: Legislation and Regulation Review final report. https://www.health.gov.au/sites/default/files/2025-07/safe-and-responsible-artificial-intelligence-in-health-care-legislation-and-regulation-review-final-report.pdf
8. Shanghai Municipal Bureau of Justice. Working Plan of Shanghai Municipality for the Development of Medical Artificial Intelligence (2025-2027). https://english.shanghai.gov.cn/en-Bulletin/20250307/79fb305d000e4ba7998857e6b331e7ee.html
9. Shenzhen Municipal Government. Shenzhen boosts healthcare with widespread AI adoption. https://www.sz.gov.cn/en_szgov/news/infocus/modern/news/content/post_12418898.html
10. DigiChina / Stanford. Translation: Personal Information Protection Law of the People's Republic of China. https://digichina.stanford.edu/work/translation-personal-information-protection-law-of-the-peoples-republic-of-china-effective-nov-1-2021/
11. Ramaswamy A, Tyagi A, Hugo H, et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. Published February 23, 2026. https://www.nature.com/articles/s41591-026-04297-7