Selection · Slicing · Gold sets · Drift

A single accuracy number is the easiest way to ship a model that's already drifting.

AI teams take a sample, label it, and report one number. EKIP turns that into a discipline: pick the data points that actually carry information, measure accuracy where it breaks, build living gold and fine-tuning sets and control all of it with knobs, so data drift is caught before the customer feels it.

The selection engine The knob catalog

One model · "92% accurate"

Overallrandom sample, 1,000 labeled

92%

Spanish · billing38 labeled

78%

Long transcripts > 10 min64 labeled

71%

New-product complaints6 labeled · emerging

Why the number lies

One sample, one gold set, one average three blind spots that all end in drift.

The standard loop grab a sample, label it, compute accuracy has three structural failures. Each one lets real-world performance fall while the dashboard stays green.

Capability 1 · Intelligent selection

Pick the points that carry information, not the ones that happen to be common.

Every transcript or document becomes a point in an embedding space. EKIP selects across that space covering it, oversampling the sparse frontier, and pulling the uncertain and the novel instead of drawing blindly from the dense center.

Same pool of documents, two ways to spend the same labeling budget. Random (left) concentrates picks where points are dense and never touches the sparse frontier; EKIP selection (right) spreads coverage and deliberately pulls the sparse, novel, and uncertain points where the model is most likely to be wrong.

Select the points

Diversity, density, uncertainty, disagreement, and novelty selection across the embedding space small set, maximum information per label.

Measure by slice

Compute accuracy per attribute and per geometric region, with confidence intervals surface the weak and the unmeasured slices, not one average.

Grow the gold set

Promote labeled, adjudicated points into a versioned gold set that is refreshed as new regions appear a living benchmark, not a frozen one.

Target fine-tuning

Build the fine-tuning corpus from the weak and drifting slices including synthesized edge cases so training fixes what evaluation actually found.

Capability 2 · Accuracy by subset & edge case

Information geometry turns "92%" into a map of exactly where it breaks.

Slice by explicit attributes (language, channel, length, document type, segment, time) and by emergent regions in the embedding space the team never pre-defined. The aggregate from the hero, unpacked:

Slice	Labeled n	Accuracy	Read
Overallthe headline number	1,000	92%	reassuring
English · billingdense head	420	96%	strong
Spanish · billingunder-sampled	38	78%	weak
Long transcripts > 10 minboundary region	64	71%	weak
New-product complaintsemerging cluster	6		unmeasured · drift

The two findings the average buried: a real weakness (Spanish billing and long transcripts well below the headline), and a drift frontier an emerging cluster with only six labeled points, growing in production, that the model has never been properly evaluated or trained on. That last row is precisely the case the customer hits before anyone notices. Each slice also needs a minimum sample size before its number is trusted; "100% on 3 examples" is not a result.

The control surface

The knobs that select data, create data, and govern the loop.

Selection and creation aren't one-off scripts they're parameterized controls a team tunes per use case and re-tunes as drift appears. Three families of knobs, shown at illustrative settings.

Selection knobs

Which existing points to pull for labeling, evaluation, or training.

Coverage radiuswide

How widely picks spread across the embedding space.

Density targetboost sparse

How hard to oversample low-density frontier regions vs the dense head.

Uncertainty cutofflow-conf

Pull points where the model is least confident near the boundary.

Novelty distancefar from gold

Pull points far from the current gold/training set the drift frontier.

Attribute quotason

Minimums per language, channel, document type, length, segment.

Recency weightrecent-tilt

Favor recent traffic so emerging patterns surface early.

Creation knobs

How to manufacture new points where real ones are too sparse to collect.

Augmentation strengthmoderate

Paraphrase / perturbation intensity applied to real examples.

Synthetic ratio30% synth

Share of generated data mixed with real in the corpus.

Target regionweak slices

Which sparse / underperforming region to generate into.

Difficulty leveladversarial

How hard and boundary-pushing the generated edge cases are.

Attribute conditioningES · long

Generate for a specific language, segment, or document type.

Label sourcemodel→human

Model proposes, expert adjudicates vs human-from-scratch.

Control knobs

How the loop measures, budgets, and governs itself.

Slice granularityfine

How finely to cut subsets when computing accuracy.

Minimum-n per slice≥ 30

Sufficiency floor before a slice's number is trusted.

Risk weightingimpact > freq

Weight slices by business impact, not just how often they occur.

Budget spliteval / FT

Allocation across evaluation, gold-refresh, and fine-tuning.

Drift sensitivityhigh

How much distribution shift triggers re-evaluation and refresh.

Automation thresholdper slice

Accuracy floor for auto-decide vs routing a slice to humans.

Why it stays accurate

The gold and fine-tuning sets live so drift is caught, not discovered by the customer.

The knobs run continuously. A drift monitor watches the production embedding distribution; when a region grows or shifts, it pulls new points, re-measures, refreshes the gold set, and queues targeted fine-tuning before the change reaches users.

The same selection and creation knobs that build the first evaluation also keep it current. Drift becomes a signal that triggers the loop, instead of a surprise that surfaces as a support ticket.

Positioning

Stop reporting accuracy. Start controlling it.

This is DataKnobs at its most literal: the data points you evaluate, the data points you create, and the thresholds you govern by are all knobs turned deliberately, re-tuned as the world shifts. The result is an evaluation that tells the truth and a model that doesn't drift out from under its own benchmark.

Next best action

Run one model through the engine: intelligent sample, sliced-accuracy map, the drift frontier, and a starter knob configuration for the gold and fine-tuning sets.

Review the knob catalog