A single accuracy number is the easiest way to ship a model that's already drifting.
AI teams take a sample, label it, and report one number. EKIP turns that into a discipline: pick the data points that actually carry information, measure accuracy where it breaks, build living gold and fine-tuning sets and control all of it with knobs, so data drift is caught before the customer feels it.
One model · "92% accurate"
One sample, one gold set, one average three blind spots that all end in drift.
The standard loop grab a sample, label it, compute accuracy has three structural failures. Each one lets real-world performance fall while the dashboard stays green.
Random misses the tail
A random sample is dominated by the common cases. Edge cases, rare languages, and unusual document types the places models fail are barely present, so their accuracy is never really measured.
The average hides the slices
92% overall can be 96% on the easy majority and 71% on a segment that matters. A single figure can't tell you which, and stakeholders read it as "the model is 92% good everywhere."
A frozen gold set ages
Input patterns shift new topics, new phrasing, new products, new customer mix. A gold set built last quarter stops representing today's traffic, so accuracy looks stable while it silently decays. That gap is drift.
Pick the points that carry information, not the ones that happen to be common.
Every transcript or document becomes a point in an embedding space. EKIP selects across that space covering it, oversampling the sparse frontier, and pulling the uncertain and the novel instead of drawing blindly from the dense center.
Same pool of documents, two ways to spend the same labeling budget. Random (left) concentrates picks where points are dense and never touches the sparse frontier; EKIP selection (right) spreads coverage and deliberately pulls the sparse, novel, and uncertain points where the model is most likely to be wrong.
Select the points
Diversity, density, uncertainty, disagreement, and novelty selection across the embedding space small set, maximum information per label.
Measure by slice
Compute accuracy per attribute and per geometric region, with confidence intervals surface the weak and the unmeasured slices, not one average.
Grow the gold set
Promote labeled, adjudicated points into a versioned gold set that is refreshed as new regions appear a living benchmark, not a frozen one.
Target fine-tuning
Build the fine-tuning corpus from the weak and drifting slices including synthesized edge cases so training fixes what evaluation actually found.
Information geometry turns "92%" into a map of exactly where it breaks.
Slice by explicit attributes (language, channel, length, document type, segment, time) and by emergent regions in the embedding space the team never pre-defined. The aggregate from the hero, unpacked:
| Slice | Labeled n | Accuracy | Read |
|---|---|---|---|
| Overallthe headline number | 1,000 | 92% | reassuring |
| English · billingdense head | 420 | 96% | strong |
| Spanish · billingunder-sampled | 38 | 78% | weak |
| Long transcripts > 10 minboundary region | 64 | 71% | weak |
| New-product complaintsemerging cluster | 6 | unmeasured · drift |
The two findings the average buried: a real weakness (Spanish billing and long transcripts well below the headline), and a drift frontier an emerging cluster with only six labeled points, growing in production, that the model has never been properly evaluated or trained on. That last row is precisely the case the customer hits before anyone notices. Each slice also needs a minimum sample size before its number is trusted; "100% on 3 examples" is not a result.
The knobs that select data, create data, and govern the loop.
Selection and creation aren't one-off scripts they're parameterized controls a team tunes per use case and re-tunes as drift appears. Three families of knobs, shown at illustrative settings.
Selection knobs
Which existing points to pull for labeling, evaluation, or training.
How widely picks spread across the embedding space.
How hard to oversample low-density frontier regions vs the dense head.
Pull points where the model is least confident near the boundary.
Pull points far from the current gold/training set the drift frontier.
Minimums per language, channel, document type, length, segment.
Favor recent traffic so emerging patterns surface early.
Creation knobs
How to manufacture new points where real ones are too sparse to collect.
Paraphrase / perturbation intensity applied to real examples.
Share of generated data mixed with real in the corpus.
Which sparse / underperforming region to generate into.
How hard and boundary-pushing the generated edge cases are.
Generate for a specific language, segment, or document type.
Model proposes, expert adjudicates vs human-from-scratch.
Control knobs
How the loop measures, budgets, and governs itself.
How finely to cut subsets when computing accuracy.
Sufficiency floor before a slice's number is trusted.
Weight slices by business impact, not just how often they occur.
Allocation across evaluation, gold-refresh, and fine-tuning.
How much distribution shift triggers re-evaluation and refresh.
Accuracy floor for auto-decide vs routing a slice to humans.
The gold and fine-tuning sets live so drift is caught, not discovered by the customer.
The knobs run continuously. A drift monitor watches the production embedding distribution; when a region grows or shifts, it pulls new points, re-measures, refreshes the gold set, and queues targeted fine-tuning before the change reaches users.
The same selection and creation knobs that build the first evaluation also keep it current. Drift becomes a signal that triggers the loop, instead of a surprise that surfaces as a support ticket.
Stop reporting accuracy. Start controlling it.
This is DataKnobs at its most literal: the data points you evaluate, the data points you create, and the thresholds you govern by are all knobs turned deliberately, re-tuned as the world shifts. The result is an evaluation that tells the truth and a model that doesn't drift out from under its own benchmark.
Run one model through the engine: intelligent sample, sliced-accuracy map, the drift frontier, and a starter knob configuration for the gold and fine-tuning sets.
Review the knob catalog