Frontier Intelligence for AI Readiness

Before a compliance AI goes live, EKIP tells you which cells it's safe to trust.

A call-center pipeline that detects complaints, clusters themes, flags regulatory violations, and scores severity is a rare-event, four-stage cascade. The same EKIP layers decide what evaluation, RAG, and fine-tuning data to collect and turn "is the AI good enough?" into a per-cell answer.

The four-question method See the readiness gate

The audit cascade · EN majority, ES minority

Detect complaintcomplaint vs venting vs routine

Cluster themesbilling · service · mis-selling · conduct…

Flag potential violationmap to regulatory taxonomy

Cluster & score severityminor → critical

Why this scenario is its own shape

Three properties drive every collection decision here.

The state space is the decision space the pipeline operates over a cell is language × stage × theme × violation-type × severity × difficulty. But three properties make it behave unlike a demographic-coverage problem.

Property 1

Rare-event, asymmetric cost

Most calls aren't complaints; most complaints aren't violations; severe violations are very rare. The expensive error is missing a real violation, not over-flagging so frequency-matched sampling under-collects exactly the cells that matter.

Property 2

Four-stage cascade

Errors compound. A complaint the stage-1 detector silently drops never reaches violation detection. Coverage and evaluation must be measured per stage and end-to-end, or cascade failures stay invisible.

Property 3

Three data products

"Data" isn't one thing. Evaluation, RAG, and fine-tuning have three different coverage frames, three failure modes, and three sequencing priorities which is the heart of readiness.

A miss at any stage removes the call from every stage after it. This is why a stage-isolated test set looks healthy while end-to-end recall on violations quietly leaks and why the coverage frame must be measured at each handoff.

The collection target splits in three

Evaluation, RAG, and fine-tuning need three different coverage frames.

Each answers a different question, fails in a different way, and is funded in a different order. Conflating them is the most common reason a compliance AI ships unsafe.

Evaluation

Gates go / no-go

The gold-labeled set that measures failure per cell above all recall on violations and severity calibration.

Coverage frame

The full decision space with the rare risk tail deliberately boosted, plus hard negatives (venting that isn't a complaint; complaints that aren't violations) to bound false-positive rate.

Fails when

Too few positives per cell to estimate error you "have" the cell but can't claim a number for it.

RAG corpus

Supplies the rules

The retrievable knowledge the model reasons over: regulations, regulator guidance, internal policy, violation definitions, severity rubrics, adjudicated precedent.

Coverage frame

The knowledge surface, not the call population. Every in-scope violation type and severity rule needs an authoritative, current, retrievable source.

Fails when

A rule exists but retrieval doesn't surface it, or the corpus is stale after a regulation changed both produce confidently wrong adjudications.

Fine-tuning

Teaches the boundary

Labeled examples that specialize each stage model and seed the theme / severity clusters.

Coverage frame

Enough labels per cell with the tail oversampled, weighted toward high-information boundary and disagreement-prone cases not more easy-majority calls.

Fails when

Spent before evaluation proves where the model is actually weak, or natural distribution starves the rare cells that carry the risk.

The four-question method

Same four moves frame, find, measure, prioritize applied across all three products.

Violation-type maps to your actual regulatory taxonomy (UDAAP / FDCPA / TCPA / Reg E / Reg Z for financial, HIPAA for healthcare, and so on); treat those as placeholders for your in-scope list.

What data to collect

Frame · EKIP Layer 1

Build the cell frame, then set a target per product:

Eval full decision space, risk tail boosted, sized for power on per-cell recall.
RAG one authoritative source per violation type, severity rule, and precedent.
Fine-tune enough high-information labels per cell, tail oversampled.

Mine historical transcripts, stratify by predicted cell, route the uncertain/high-stakes ones to expert labels. One adjudicated case can serve all three products.

What is missing

Frontier Intelligence

Run gap detection per stage and per product:

Coverage gaps e.g. severe-violation Spanish calls; a violation type never seen.
Sufficiency gaps present but too few to bound error (the quiet killer here).
RAG gaps retrievable-but-not-retrieved, and stale-after-rule-change.
Cascade gaps only visible end-to-end, not in stage-isolated tests.

How to measure representation

Information Geometry

Profile production reality, then measure two things in parallel:

Risk-weighted coverage every cell weighted by regulatory exposure × frequency, never frequency alone.
Distance JS divergence and blind-spot mass for the audit.

RAG recall@k per violation type · Eval CI width on per-cell recall · severity calibration · explicit EN vs ES parity.

How to prioritize budget

Data Knob Intelligence

The scarce resource is expert (legal/compliance) reviewer time. Sequence by dependency:

1. RAG first can't adjudicate without the rule; bounded, high-leverage.
2. Eval next can't claim readiness without measuring; gates launch.
3. Fine-tune last only where eval shows weakness and the cell is high-risk.

Within each: rank by (risk × lift) ÷ marginal cost, model-assisted pre-labeling to concentrate expert time, then re-measure and re-prioritize each round.

Why "audit a random sample" doesn't work

The rare-event sufficiency math is the whole argument.

To claim a recall number on violations, you need enough positives in the eval set to estimate that proportion with a tight confidence interval. Rarity makes random sampling collapse.

Positives needed to bound recall (95% CI)

Target CI half-width	Violation positives needed
± 10%	~35
± 5%	~139
± 3%	~385

These are positives in the test set, not total calls. The question is how many calls you must review to find them.

At 0.5% severe-violation prevalence, for ±5%

27,800calls reviewed if you sample randomly to collect ~139 positives

~556candidates reviewed with a stratified pre-filter at ~25% yield

≈ 50× less expert-review effort for the same statistical confidence and at 50% pre-filter yield it's ~100×. That leverage is Frontier Intelligence: target the sparse, high-impact cells instead of waiting for them to appear in a random draw.

The corollary: the same logic governs fine-tuning. Random labeling spends almost the entire budget teaching the model cells it already handles. Frontier-targeted, model-assisted labeling spends it on the boundary cases that move the decision fewer labels, faster lift, scarce expert time preserved.

The payoff

A per-cell production-readiness gate, not a vibe.

Running EKIP before launch produces a defensible criterion. The AI is cleared to auto-decide a cell only when all three gates pass; cells that fail don't block the launch they route to human review in production.

Coverage isn't binary across the whole system it's resolved cell by cell. The gate converts a frontier map into an operating policy: confident automation where the data supports it, human judgment where it doesn't, and a ranked collection backlog to expand the green zone over time.

Green cells

RAG source present, recall bounded tightly, metrics and EN/ES parity above threshold. The AI decides; humans spot-check.

Amber cells

Any gate unmet. The AI assists but a human adjudicates and the cell enters the collection backlog, ranked by risk × lift ÷ cost.

Positioning

The same EKIP category, a higher-stakes vertical.

Compliance AI is exactly where "we have lots of data" hides the failures that matter. Framed as Frontier Intelligence for AI Readiness, EKIP gives a buyer a launch gate, a regulator-defensible measurement story, and a collection plan that spends scarce expert time where it changes the decision.

Next best action

Wire these into the prototype: cascade coverage map, three-product readiness scorecard, rare-event collection planner, and the per-cell auto-decide vs human-in-the-loop policy.

Review the three products