Shifting Paradigms

Software vs.
Data Products.

Treating a Data Product as if it were a Software Product simply because they share the same name is a sure way to set yourself up for failure. Data brings with it significant challenges such as massive state, statistical unpredictability, and intricate lineage.

Software Product vs Data Product Differences

A Fundamental Divergence

Developing a strong Data Product requires more than just Software Engineering principles; this is where the two disciplines separate.

Software Product

Code is the Asset

The code sets the rules for behavior, while state is often temporary or stored in external transactional databases.

Data Product

Data is the Asset

The code is simply a tool, while the extensive, historically collected data is the true output.

Software Product

Deterministic Testing

We rely on unit and integration tests to ensure that inputs and outputs are consistently predictable.

Data Product

Probabilistic Testing

Data shapes are always evolving, requiring testing to depend on anomaly detection, schema contracts, and statistical boundaries.

Software Product

Loud Failures

Software typically breaks by throwing an exception, crashing, or returning a 500 error, resulting in an immediate and visible failure.

Data Product

Silent Failures

Many times pipelines follow a logical path, but the data can drift, become null, or replicate, leading to hidden bugs that can negatively impact downstream machine learning and business intelligence processes.

Software Product

Easily Replicable

Setting up a staging environment is simple. All you need to do is deploy the code to a fresh container with simulated data.

Data Product

Heavy State

Moving petabytes of production data to a development environment is not a simple task. It involves intricate sampling techniques and zero-copy cloning.

Operational Reality

Beyond CI/CD: The Need for Continuous Data Validation

Software engineering heavily depends on Continuous Integration and Continuous Deployment (CI/CD) for code, but Data Products need an additional dimension due to the constant flow and independent changes of data.

It is essential to implement Continuous Data Validation (CDV) to ensure that each time a pipeline is executed, the shape, volume, and statistical distribution of the data are validated against predefined contracts prior to being sent to output ports.

Lifecycle Evolution

Continuous Integration

Testing pipeline logic and SQL syntax.

Continuous Deployment

Deploying airflow DAGs and dbt models.

Continuous Data Validation

Checking data contracts, null rates, and distribution drift at runtime.

Transition Your Engineering Approach

Cease using purely software patterns for data issues. Embrace DataOps for effective management of the state, testing, and lifecycle of your Data Products.

Review Core Attributes