Orthogonal Data Knobs
In many domains, companies have to run thousands of experiments to find plausible candidates. Data scientist team has to code experiment. But a team can only manage 5-10 or may be 50 experiments. Runnig hundreds of experiments and comparing models become unmanageble. Moreover the experiments are hidden behind data scientist desk. When they leave and new resource join, whole thing start over again.
Problems in which large number of experiments need to run, should be manage thru dials or knobs. Using knobs data scientist can use their statistics, domain knowledge and valida/invalidate hypothesis. The outcome of experiments are recorded and even if results are not fruitful, it increase knowledge base.
Knobs for experimentation
We can define experimentation problem as - we are given a pool of preprocessing methods, feature transformation, ML algorithms and hyper parameters. Our goal is to select the combination of these data processing and ML methods that produce the best model result for a given data set.
The system should deal with the messiness and complexity of data, automate feature selection, select machine learning (ML) algorithm to train a model. The system does it in such a manner that is efficient and robust and considers constraints not only about accuracy but memory, compute time, data need etc
As data pattern will continue to change and you want data scientist to make decision - features, model paramters, we define the solution in which data scientists can interact and explore in a semi-automated manner using orthogonal dials or knobs.
Orthogonal knobs are dials which data scientist or domain expert can tune. They can choose different features or normalize feature in diffeent manner, they can choose different algorithm or different loss function
They are similar to model hyper paramter, But model hyper paramters are only for model algorithms. Model hyper paramters are model algorithm code dependant
Philosphy behind data knobs are these are parameters that are bottoms up generated based on data and how these data is used in process. These are super set of hyper paramters as it let you choose features, featue transformation, data sources, loss functional computations etc
Problem can be mathmatically represented as:
Constraints: Data scientist time, accuracy, etcOutput
Consider a vector θ. It includes all possible operations on data (e.g. ingestion, transformation, feature engineering, modeling, hyperparameter tuning)
θ = [ θ ₁, θ ₂, …, θ ₙ]Note: For simplicity, we can consider all θ n as simple element operations. In elaborate settings, trees and graphs can be used to represent dependencies/hierarchy of operations.
Refined Problem Statement
We can define problem statement as - we have a pool of preprocessing methods, feature transformation methods, ML algorithms, and hyperparameters. The goal is to select the combination of knobs that produce the best results. Goal is to identify these knobs so that one can use different settings when data pattern changes.Goal
Once we define the θ vector, it simplify modeling and data science work. Now data scientist and domain expert focus on validating hypothesis, they are not worried to ensure whether some made short cut in feature transformation or made a mistake
You get following benefits
Differential privacy blog
Know about differential privacy at Differential privacy blog
Learn about algorithms - K-Anonymizatio, T-Closeness, L-diversisty, Delta presence
Learn about frameowrk to apply Differential privacy using data knobs