The corpus
We ingest continuously from a few hundred sources. As of this moment, the corpus stands at:
Language coverage spans the usual English-tier plus Arabic, Russian, Mandarin, Spanish, French, Portuguese, German, Hindi, Urdu, Persian, Turkish, Indonesian, Swahili, Korean, Japanese, Vietnamese, Bengali, Tamil, and a long tail of regional dailies. The 47-language figure is the count of ISO-639 codes we have active ingestion for — not a vanity claim about reading every language on Earth (see §7).
Publication coverage runs across wire services, named national and regional outlets, think tanks, open-source intelligence collectives, and social-surfaced primary sources. The top 20 by ingestion volume are published on request; the full list is in our internal config.
Cadence: continuous for wire-grade sources, hourly for named secondaries, daily-batched for the long tail. Retention: articles indefinitely, structured events indefinitely, entity-extraction cache 90 days.
The pipeline
Article becomes event becomes profile becomes report. The stack:
SOURCE (RSS · API · scrape)
│
▼
INGEST (normalize · dedupe · language-detect · translate-hint)
│
▼
ENTITY EXTRACTION (people · places · orgs · events · relations)
│
▼
STRUCTURED EVENT (geocoded · timestamped · typed · credibility-weighted)
│
▼
PROFILE AGGREGATION (country · conflict · person · publication)
│
▼
REPORT GENERATION (Sonnet 4.6 · Haiku · Llama-planner)
│
▼
EDITORIAL (human-authored · cited against the corpus)Ingest normalizes HTML to canonical text, dedupes by URL and content hash, runs fastText language detection, and flags translation needs for downstream.
Entity extraction uses NER with a custom gazetteer for geopolitical proper nouns that off-the-shelf models misclassify. Geocoding resolves place mentions to ISO-3166 codes where the context is unambiguous; ambiguous references are left as text.
Structured event is our internal unit — a timestamped, geocoded, typed record of a thing that happened, with links back to the originating articles. Event typing is a closed taxonomy (~52 categories in V3, down from 8,200+ LLM-invented types in V2).
Profile aggregation rolls events into country and conflict rolling windows, plus person and publication profiles that feed credibility scoring.
Report generation (see §4) is the model-authored step. Always cited, always reviewed by a human before publish.
Editorial is human-authored on top of the same corpus, with the same citation discipline. We don’t publish model-written editorial.
The scoring
Credibility, confidence, stability — all run through explicit formulas, not model judgment.
Publication credibility score
A composite in [0, 100]:
age_score· weight 0.15 · log of years-since-first-observed-byline, clipped at 30volume_score· weight 0.20 · log of distinct author count, clipped at 500cross_ref_score· weight 0.30 · fraction of our ingested articles that cite this publication by namewhitelist_bonus· weight 0.20 · flat bonus for known-reliable outlets (wire services, academic presses)recency_score· weight 0.15 · how recently the publication has published in our window
Weights are reviewed quarterly and published in the “State of the dataset” post (§8). When we change a weight, we announce it, restate the formula, and regenerate scores. We do not retroactively alter published reports; the score at time-of-publish is what stands.
Seven-pillar stability score
A country-level composite, weighted across political, security, economic, regulatory, operational, institutional, and societal pillars. Each pillar pulls from structured corpus signals over a rolling 90-day window. Known weakness: over-rewards authoritarian stability (Gulf monarchies score higher than we believe they should). Rewrite on the Q3 roadmap.
Report confidence
Classified HIGH · MEDIUM · LOW based on: external-source count ≥ 10 + total-citation count ≥ 20 (HIGH); external ≥ 5 or total ≥ 12 (MEDIUM); else LOW. Always visible on the report detail page.
The models
| Model | What we use it for | What we don’t |
|---|---|---|
| Claude Sonnet 4.6 | Primary report authoring · editorial-room suggestions · composer side-car | Fact-assertion without dataset citation · off-corpus reasoning at load-bearing weight |
| Claude Haiku | Fallback author for cost-sensitive batches · tag suggestion · short summaries | Anything shown to a paying customer under “Sonnet output” branding |
| Llama (planner) | Decomposing a report prompt into structured corpus queries | Writing the final report copy |
We use LLMs to assemble citations faster. We do not use LLMs to decide what is true.
Model choices get updated when the frontier moves. When we change the primary, we say so in the next “State of the dataset” post along with the eval deltas that justified the swap.
Sourcing hierarchy
Every article citation and report citation carries one of these five labels inline. Labels are visible to readers, not internal-only.
- Primary — the subject’s own statement, the raw document, the court filing, the satellite image
- Wire — Reuters, AP, AFP, Bloomberg — high-recency, factual, low-analysis
- Secondary — named regional publications with track records in our
publication_scorestable - Opinion — columnists, think-tank papers, blog posts — identified as opinion inline
- Counter-narrative — explicitly sourced from the perspective we’re arguing against, required in Red Team pieces
When a piece argues against a consensus, counter-narrative sources are non-negotiable. A Red Team piece without a named counter-narrative source doesn’t ship.
Corrections policy
Three commitments:
- Itemized, not aggregated. When we’re wrong, we say what we got wrong, in a post of its own or as a dated update block on the original piece. No stealth edits.
- Dated-update blocks. A correction to a report increments the
versionon the row and renders a visible “Updated — see change log” strip at top of the report page. - The “we were wrong” editorial. When a call was substantially wrong, we publish a standalone piece on /editorial analyzing why. These are pinned for 30 days.
What we don't cover
Named gaps. Not a comprehensive list — just the ones we get asked about.
- Minor-language coverage — Pashto, Uzbek, Tigrinya, Quechua, Māori, and about a dozen more. Roadmap Q3.
- Closed-regime internal dynamics — North Korea, Turkmenistan, Eritrea. We read what leaks and cite it; we do not pretend to insider knowledge.
- Tactical military analysis — we cover strategic posture, not battle-damage-assessment-on-the-hour. Other people do it better.
- Financial market calls — we read market signals as inputs to geopolitical analysis; we do not publish trading views.
- US domestic politics — except where it materially affects foreign-policy execution.
State of the dataset
Quarterly, we publish a long-form methodology post on /editorial that opens up the dataset. Each issue itemizes:
- Corpus deltas — articles, events, reports, authors · absolute + QoQ
- Language coverage delta — what was added, what’s still missing
- Publication credibility distribution — histogram, top additions and top demotions
- Scoring formula changes — any weight change in
publication_scoresor the stability score, with before/after and rationale - Model changes — primary · fallback · planner
- Corrections ledger — every correction issued this quarter, itemized, with links
- Forecast scoring — last quarter’s calls, graded against outcomes
- Gaps fixed · gaps still open
The first “State of the dataset” ships within 30 days of this page going live. It is the first major editorial piece in the queue after launch.
We publish these four times a year — mid-January, mid-April, mid-July, mid-October — pinned to /editorial for 14 days before sliding into the archive.
