Before a compliance AI goes live, EKIP tells you which cells it's safe to trust.
A call-center pipeline that detects complaints, clusters themes, flags regulatory violations, and scores severity is a rare-event, four-stage cascade. The same EKIP layers decide what evaluation, RAG, and fine-tuning data to collect and turn "is the AI good enough?" into a per-cell answer.
The audit cascade · EN majority, ES minority
Three properties drive every collection decision here.
The state space is the decision space the pipeline operates over a cell is language × stage × theme × violation-type × severity × difficulty. But three properties make it behave unlike a demographic-coverage problem.
Rare-event, asymmetric cost
Most calls aren't complaints; most complaints aren't violations; severe violations are very rare. The expensive error is missing a real violation, not over-flagging so frequency-matched sampling under-collects exactly the cells that matter.
Four-stage cascade
Errors compound. A complaint the stage-1 detector silently drops never reaches violation detection. Coverage and evaluation must be measured per stage and end-to-end, or cascade failures stay invisible.
Three data products
"Data" isn't one thing. Evaluation, RAG, and fine-tuning have three different coverage frames, three failure modes, and three sequencing priorities which is the heart of readiness.
A miss at any stage removes the call from every stage after it. This is why a stage-isolated test set looks healthy while end-to-end recall on violations quietly leaks and why the coverage frame must be measured at each handoff.
Evaluation, RAG, and fine-tuning need three different coverage frames.
Each answers a different question, fails in a different way, and is funded in a different order. Conflating them is the most common reason a compliance AI ships unsafe.
Evaluation
The gold-labeled set that measures failure per cell above all recall on violations and severity calibration.
Coverage frame
The full decision space with the rare risk tail deliberately boosted, plus hard negatives (venting that isn't a complaint; complaints that aren't violations) to bound false-positive rate.
Fails when
Too few positives per cell to estimate error you "have" the cell but can't claim a number for it.
RAG corpus
The retrievable knowledge the model reasons over: regulations, regulator guidance, internal policy, violation definitions, severity rubrics, adjudicated precedent.
Coverage frame
The knowledge surface, not the call population. Every in-scope violation type and severity rule needs an authoritative, current, retrievable source.
Fails when
A rule exists but retrieval doesn't surface it, or the corpus is stale after a regulation changed both produce confidently wrong adjudications.
Fine-tuning
Labeled examples that specialize each stage model and seed the theme / severity clusters.
Coverage frame
Enough labels per cell with the tail oversampled, weighted toward high-information boundary and disagreement-prone cases not more easy-majority calls.
Fails when
Spent before evaluation proves where the model is actually weak, or natural distribution starves the rare cells that carry the risk.
Same four moves frame, find, measure, prioritize applied across all three products.
Violation-type maps to your actual regulatory taxonomy (UDAAP / FDCPA / TCPA / Reg E / Reg Z for financial, HIPAA for healthcare, and so on); treat those as placeholders for your in-scope list.
What data to collect
Frame · EKIP Layer 1Build the cell frame, then set a target per product:
- Eval full decision space, risk tail boosted, sized for power on per-cell recall.
- RAG one authoritative source per violation type, severity rule, and precedent.
- Fine-tune enough high-information labels per cell, tail oversampled.
Mine historical transcripts, stratify by predicted cell, route the uncertain/high-stakes ones to expert labels. One adjudicated case can serve all three products.
What is missing
Frontier IntelligenceRun gap detection per stage and per product:
- Coverage gaps e.g. severe-violation Spanish calls; a violation type never seen.
- Sufficiency gaps present but too few to bound error (the quiet killer here).
- RAG gaps retrievable-but-not-retrieved, and stale-after-rule-change.
- Cascade gaps only visible end-to-end, not in stage-isolated tests.
How to measure representation
Information GeometryProfile production reality, then measure two things in parallel:
- Risk-weighted coverage every cell weighted by regulatory exposure × frequency, never frequency alone.
- Distance JS divergence and blind-spot mass for the audit.
RAG recall@k per violation type · Eval CI width on per-cell recall · severity calibration · explicit EN vs ES parity.
How to prioritize budget
Data Knob IntelligenceThe scarce resource is expert (legal/compliance) reviewer time. Sequence by dependency:
- 1. RAG first can't adjudicate without the rule; bounded, high-leverage.
- 2. Eval next can't claim readiness without measuring; gates launch.
- 3. Fine-tune last only where eval shows weakness and the cell is high-risk.
Within each: rank by (risk × lift) ÷ marginal cost, model-assisted pre-labeling to concentrate expert time, then re-measure and re-prioritize each round.
The rare-event sufficiency math is the whole argument.
To claim a recall number on violations, you need enough positives in the eval set to estimate that proportion with a tight confidence interval. Rarity makes random sampling collapse.
Positives needed to bound recall (95% CI)
| Target CI half-width | Violation positives needed |
|---|---|
| ± 10% | ~35 |
| ± 5% | ~139 |
| ± 3% | ~385 |
These are positives in the test set, not total calls. The question is how many calls you must review to find them.
At 0.5% severe-violation prevalence, for ±5%
≈ 50× less expert-review effort for the same statistical confidence and at 50% pre-filter yield it's ~100×. That leverage is Frontier Intelligence: target the sparse, high-impact cells instead of waiting for them to appear in a random draw.
The corollary: the same logic governs fine-tuning. Random labeling spends almost the entire budget teaching the model cells it already handles. Frontier-targeted, model-assisted labeling spends it on the boundary cases that move the decision fewer labels, faster lift, scarce expert time preserved.
A per-cell production-readiness gate, not a vibe.
Running EKIP before launch produces a defensible criterion. The AI is cleared to auto-decide a cell only when all three gates pass; cells that fail don't block the launch they route to human review in production.
Coverage isn't binary across the whole system it's resolved cell by cell. The gate converts a frontier map into an operating policy: confident automation where the data supports it, human judgment where it doesn't, and a ranked collection backlog to expand the green zone over time.
RAG source present, recall bounded tightly, metrics and EN/ES parity above threshold. The AI decides; humans spot-check.
Any gate unmet. The AI assists but a human adjudicates and the cell enters the collection backlog, ranked by risk × lift ÷ cost.
The same EKIP category, a higher-stakes vertical.
Compliance AI is exactly where "we have lots of data" hides the failures that matter. Framed as Frontier Intelligence for AI Readiness, EKIP gives a buyer a launch gate, a regulator-defensible measurement story, and a collection plan that spends scarce expert time where it changes the decision.
Wire these into the prototype: cascade coverage map, three-product readiness scorecard, rare-event collection planner, and the per-cell auto-decide vs human-in-the-loop policy.
Review the three products