Selection · Slicing · Gold sets · Drift

A single accuracy number is the easiest way to ship a model that's already drifting.

AI teams take a sample, label it, and report one number. EKIP turns that into a discipline: pick the data points that actually carry information, measure accuracy where it breaks, build living gold and fine-tuning sets and control all of it with knobs, so data drift is caught before the customer feels it.

One model · "92% accurate"

Overallrandom sample, 1,000 labeled
92%
Spanish · billing38 labeled
78%
Long transcripts > 10 min64 labeled
71%
New-product complaints6 labeled · emerging
?
Why the number lies

One sample, one gold set, one average three blind spots that all end in drift.

The standard loop grab a sample, label it, compute accuracy has three structural failures. Each one lets real-world performance fall while the dashboard stays green.

Blind spot 1

Random misses the tail

A random sample is dominated by the common cases. Edge cases, rare languages, and unusual document types the places models fail are barely present, so their accuracy is never really measured.

Blind spot 2

The average hides the slices

92% overall can be 96% on the easy majority and 71% on a segment that matters. A single figure can't tell you which, and stakeholders read it as "the model is 92% good everywhere."

Blind spot 3

A frozen gold set ages

Input patterns shift new topics, new phrasing, new products, new customer mix. A gold set built last quarter stops representing today's traffic, so accuracy looks stable while it silently decays. That gap is drift.

Capability 1 · Intelligent selection

Pick the points that carry information, not the ones that happen to be common.

Every transcript or document becomes a point in an embedding space. EKIP selects across that space covering it, oversampling the sparse frontier, and pulling the uncertain and the novel instead of drawing blindly from the dense center.

Random sample clusters in the dense head · edge cases unseen edges missed EKIP selection spans the space · samples sparse + novel regions edge cases caught

Same pool of documents, two ways to spend the same labeling budget. Random (left) concentrates picks where points are dense and never touches the sparse frontier; EKIP selection (right) spreads coverage and deliberately pulls the sparse, novel, and uncertain points where the model is most likely to be wrong.

1

Select the points

Diversity, density, uncertainty, disagreement, and novelty selection across the embedding space small set, maximum information per label.

2

Measure by slice

Compute accuracy per attribute and per geometric region, with confidence intervals surface the weak and the unmeasured slices, not one average.

3

Grow the gold set

Promote labeled, adjudicated points into a versioned gold set that is refreshed as new regions appear a living benchmark, not a frozen one.

4

Target fine-tuning

Build the fine-tuning corpus from the weak and drifting slices including synthesized edge cases so training fixes what evaluation actually found.

Capability 2 · Accuracy by subset & edge case

Information geometry turns "92%" into a map of exactly where it breaks.

Slice by explicit attributes (language, channel, length, document type, segment, time) and by emergent regions in the embedding space the team never pre-defined. The aggregate from the hero, unpacked:

SliceLabeled nAccuracyRead
Overallthe headline number1,00092%reassuring
English · billingdense head42096%strong
Spanish · billingunder-sampled3878%weak
Long transcripts > 10 minboundary region6471%weak
New-product complaintsemerging cluster6unmeasured · drift

The two findings the average buried: a real weakness (Spanish billing and long transcripts well below the headline), and a drift frontier an emerging cluster with only six labeled points, growing in production, that the model has never been properly evaluated or trained on. That last row is precisely the case the customer hits before anyone notices. Each slice also needs a minimum sample size before its number is trusted; "100% on 3 examples" is not a result.

The control surface

The knobs that select data, create data, and govern the loop.

Selection and creation aren't one-off scripts they're parameterized controls a team tunes per use case and re-tunes as drift appears. Three families of knobs, shown at illustrative settings.

Selection knobs

Which existing points to pull for labeling, evaluation, or training.

Coverage radiuswide

How widely picks spread across the embedding space.

Density targetboost sparse

How hard to oversample low-density frontier regions vs the dense head.

Uncertainty cutofflow-conf

Pull points where the model is least confident near the boundary.

Novelty distancefar from gold

Pull points far from the current gold/training set the drift frontier.

Attribute quotason

Minimums per language, channel, document type, length, segment.

Recency weightrecent-tilt

Favor recent traffic so emerging patterns surface early.

Creation knobs

How to manufacture new points where real ones are too sparse to collect.

Augmentation strengthmoderate

Paraphrase / perturbation intensity applied to real examples.

Synthetic ratio30% synth

Share of generated data mixed with real in the corpus.

Target regionweak slices

Which sparse / underperforming region to generate into.

Difficulty leveladversarial

How hard and boundary-pushing the generated edge cases are.

Attribute conditioningES · long

Generate for a specific language, segment, or document type.

Label sourcemodel→human

Model proposes, expert adjudicates vs human-from-scratch.

Control knobs

How the loop measures, budgets, and governs itself.

Slice granularityfine

How finely to cut subsets when computing accuracy.

Minimum-n per slice≥ 30

Sufficiency floor before a slice's number is trusted.

Risk weightingimpact > freq

Weight slices by business impact, not just how often they occur.

Budget spliteval / FT

Allocation across evaluation, gold-refresh, and fine-tuning.

Drift sensitivityhigh

How much distribution shift triggers re-evaluation and refresh.

Automation thresholdper slice

Accuracy floor for auto-decide vs routing a slice to humans.

Why it stays accurate

The gold and fine-tuning sets live so drift is caught, not discovered by the customer.

The knobs run continuously. A drift monitor watches the production embedding distribution; when a region grows or shifts, it pulls new points, re-measures, refreshes the gold set, and queues targeted fine-tuning before the change reaches users.

Watch traffic embedding drift monitor Select points novel · sparse · uncertain Measure slices per region + CI Refresh gold + build fine-tune set Retrain / re-gate before users feel it continuous every cycle re-syncs the benchmark to reality

The same selection and creation knobs that build the first evaluation also keep it current. Drift becomes a signal that triggers the loop, instead of a surprise that surfaces as a support ticket.

Positioning

Stop reporting accuracy. Start controlling it.

This is DataKnobs at its most literal: the data points you evaluate, the data points you create, and the thresholds you govern by are all knobs turned deliberately, re-tuned as the world shifts. The result is an evaluation that tells the truth and a model that doesn't drift out from under its own benchmark.

Next best action

Run one model through the engine: intelligent sample, sliced-accuracy map, the drift frontier, and a starter knob configuration for the gold and fine-tuning sets.

Review the knob catalog