Data Knobs | Experiment thru orthogonal knobs


Orthogonal Data Knobs

In many domains, companies have to run thousands of experiments to find plausible candidates. Data scientist team has to code experiment. But a team can only manage 5-10 or may be 50 experiments. Runnig hundreds of experiments and comparing models become unmanageble. Moreover the experiments are hidden behind data scientist desk. When they leave and new resource join, whole thing start over again.

Problems in which large number of experiments need to run, should be manage thru dials or knobs. Using knobs data scientist can use their statistics, domain knowledge and valida/invalidate hypothesis. The outcome of experiments are recorded and even if results are not fruitful, it increase knowledge base.

Knobs for experimentation

We can define experimentation problem as - we are given a pool of preprocessing methods, feature transformation, ML algorithms and hyper parameters. Our goal is to select the combination of these data processing and ML methods that produce the best model result for a given data set.

The system should deal with the messiness and complexity of data, automate feature selection, select machine learning (ML) algorithm to train a model. The system does it in such a manner that is efficient and robust and considers constraints not only about accuracy but memory, compute time, data need etc

As data pattern will continue to change and you want data scientist to make decision - features, model paramters, we define the solution in which data scientists can interact and explore in a semi-automated manner using orthogonal dials or knobs.

Orthogonal knobs are dials which data scientist or domain expert can tune. They can choose different features or normalize feature in diffeent manner, they can choose different algorithm or different loss function

They are similar to model hyper paramter, But model hyper paramters are only for model algorithms. Model hyper paramters are model algorithm code dependant

Philosphy behind data knobs are these are parameters that are bottoms up generated based on data and how these data is used in process. These are super set of hyper paramters as it let you choose features, featue transformation, data sources, loss functional computations etc

Problem can be mathmatically represented as:

Model(M)

Input
  • Dataset {Xi, Yi}
  • Objective function J(f) to evaluate model performance
  • Constraints: Data scientist time, accuracy, etc

    Output
  • A trained model in the form y = f(x)
  • We can describe this in form of y = f(x; α)
  • Where set α   =  [ α ₀, α ₁, α ₂, …, αₙ] are parameters of model
  • Processing

    Consider a vector θ. It includes all possible operations on data (e.g. ingestion, transformation, feature engineering, modeling, hyperparameter tuning)

    θ   =  [ θ ₁, θ ₂, …, θ ₙ]

    Note: For simplicity, we can consider all θ n as simple element operations. In elaborate settings, trees and graphs can be used to represent dependencies/hierarchy of operations.
    Refined Problem Statement

    We can define problem statement as - we have a pool of preprocessing methods, feature transformation methods, ML algorithms, and hyperparameters. The goal is to select the combination of knobs that produce the best results. Goal is to identify these knobs so that one can use different settings when data pattern changes.

    Goal
  • Efficiently find set of elements in θ that produce the best α
  • Enable building Orthogonal knobs O
  • Steps to implement
  • Intelligently and efficiently determine a set of values in θ that will produce results.
  • Automate execution of θ vector to produce α and evaluate the result
  • Enable creating higher-level θs and build dials O[] control
  • Once we define the θ vector, it simplify modeling and data science work. Now data scientist and domain expert focus on validating hypothesis, they are not worried to ensure whether some made short cut in feature transformation or made a mistake

    You get following benefits

  • Ability to run large number of experiment. Most experiment do not equire code changes. you change knobs settings.
  • Ability to run reproducuible experiments
  • Ability to log experiment outcome in meaningful manner - set of knobs and outcome. If someone has run experiment before in organization,they will know it. Team will build on each other experiments
  • Differential privacy blog


    Know about differential privacy at Differential privacy blog

    Learn about algorithms - K-Anonymizatio, T-Closeness, L-diversisty, Delta presence

    Learn about frameowrk to apply Differential privacy using data knobs