The Data Quality Problem

One of the central challenges in data science (DS) and machine learning (ML) is managing and maintaining data quality. As an ML engineer and practitioner who frequently constructs, cleans, explores, and models proprietary (i.e. non-benchmark) datasets, "bad data" makes all the difference between accurate versus misleading data visualizations, statistical inferences, and models. In this article I want to hone in on three problems that are fairly unique to DS/ML practice when it comes to dealing with tabular data in the Python ecosystem using pandas, which is one of the de facto tools for data manipulation in the DS/ML toolchain.

Data Integrity Errors Fail Silently

Tooling for type safety is improving in the Python ecosystem with the broad adoption of the typing module and projects like mypy, which eases the developer experience for writing readable, reliable code.

However, for most DS/ML work, this isn't quite sufficient. This is because logical data types don't always capture the statistical distributions of the variables under study, which is a key thing to do when, for example, your data distribution shifts as a result of a world-wide pandemic, causing ML models to break in unexpected ways.

Having systems in place that fail early (and loudly ๐Ÿ”Š) when the data distribution is not what you assumed is one of the critical pieces to building reliable production systems, and we need better tooling for this.

Testing Software is Hard

Another important tool in the developer's arsenal is testing. There are many books, articles, and discussions available online about different types of testing techniques and how to put them into practice. I won't dive deeply into it here, but in short, testing your code makes it easier to change it and know when you've broken something while at the same time serving as documentation.

Even in exploratory or research contexts, it's a good idea to write tests for your code because it strengthens your confidence in the robustness of the insights that you're taking away from your analysis.

Testing Software and Statistical Distributions is Even Harder

The challenge with testing software compounds when processing data for the purpose of statistical analysis and modeling.

Consider a machine learning pipeline that creates a predictive model from survey responses. The barriers to testing data and transformation code tends to be much higher than the business logic that processes survey responses and stores raw values in a database because the latter tends to be simpler and more atomic by design.

By atomic โš› I mean that each piece of data that's filled out by respondents and stored in the database can be tested in isolation without having to analyze the aggregate statistical patterns across a larger sample of responses. On the other hand, for my statistical analysis to make sense, the overall integrity of the statistical distribution ๐Ÿ“Š of the responses needs to be taken into account.

Technical Debt: No Tests == Legacy Code

Because the effort that goes into exploring, cleaning, and figuring out how to test my dataset is so high, I'm discouraged from writing tests for my pipeline code. As the famous software development quip goes:

"legacy code is code without tests"

But you might think: "but when I put my code in production, surely I โ€“ or one of my collaborators โ€“ will write tests then?". The thing is, regardless of who writes those tests or when they're written, someone'll have to do it at some point, so the sooner and more quickly you can climb the technical debt mountain ๐Ÿ” the better!

In the rest of this post I'll try to convince you that statistical typing gives us the tools we need to do just that.

What's Statistical Typing?

If you've used strong, statically-typed languages before, or the mypy static type checker with type-hinting in Python, you may have noticed that type definitions can often catch nasty type-related bugs ๐Ÿž that render certain kinds of unit tests unnecessary. Other tools, like pydantic, enforce types at runtime via a data parsing model.

Beyond Logical Data Types

Statistical typing extends the concept of logical data types to the class of statistical data types and, ultimately, probability distributions. Statistical data types builds on top of logical data types, and in fact there's considerable overlap between the two.

For example, the binary logical data type is also a statistical data type. The key difference is that statistical data types hold additional semantics that govern the kinds of statistical operations that we can perform on variables of a particular type and probability distributions that describe those variables.

Statistical Distributions as Schemas

What if you can specify the set of acceptable values that a variable can take, from the data type, set/range of values, or even what distribution a particular variable is drawn from? This is the goal of statistical typing: to enumerate a practical set of constraints that specify what should be considered valid data for a particular dataset.

For example, we might want a categorical variable to be drawn somewhat uniformly from a set of values {A, B, C}. We can express this as a hypothesis test that causes our pipeline to fail if any one of the values occurs significantly more frequently than the others, given a pre-defined level of statistical significance.

Or we may want a real-valued variable to be drawn roughly from a normal distribution with mean ยต and variance ฯƒ, which can also be specified with an alpha value that we deem acceptable for a particular analysis.

In essence, a statistical typing implementation involves specifying three kinds of metadata for a given set of variables:

  1. logical data types, e.g. int, str, float, etc.
  2. deterministic properties, e.g. categorical values and real-valued ranges
  3. probabilistic properties, e.g. sufficient statistics like mean and standard deviation

The challenge presented by item 3 is obvious: discovering the underlying probability distributions of real-world data is often non-trivial. However, even if we can only express or automatically infer these metadata up to point 2, we can still get something quite powerful ๐Ÿ’ช: property-based testing of statistical analysis code.

Schemas as Generative Data Contracts 📜

With a statistically-typed dataframe, not only can we validate real-world data to ensure that our assumptions about them hold up, but we can also test our data transformation, analysis, and modeling code given valid samples according to our schema definition. Statistical typing effectively gives DS/ML practitioners the tools to easily isolate their code from real-world data, providing a convenient way of implementing unit tests.

Statistical Typing in Practice with pandera

Let me illustrate how these concepts would work in practice with a toy problem using pandera, a runtime data validation library for pandas dataframes that I've been developing over the last few years.

Suppose you're building a predictive model of house prices given features about different houses:

raw_data = """
square_footage,n_bedrooms,property_type,price
750,1,condo,200000
900,2,condo,400000
1200,2,house,500000
1100,3,house,450000
1000,2,condo,300000
1000,2,townhouse,300000
1200,2,townhouse,350000
"""

In the raw data above you can see that we have the following columns:

  • feature 1: square_footage
  • feature 2: n_bedrooms
  • feature 3: property_type
  • target: price

Our modeling pipeline will involve two steps:

def process_data(raw_data):  # step 1: prepare data for model training
    ...
    
def train_model(processed_data): # step 2: fit a model on processed data
    ...

Defining Schemas with pandera

At its core, pandera provides a flexible and expressive API for defining dataframe schemas and seamlessly integrating data validation logic into your data analysis pipelines, all while separating the concerns of data cleaning and validation.

import pandera as pa
from pandera.typing import Series


PROPERTY_TYPES = ["condo", "townhouse", "house"]


class BaseSchema(pa.SchemaModel):
    square_footage: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 3000})
    n_bedrooms: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 10})
    price: Series[float] = pa.Field(in_range={"min_value": 0, "max_value": 1000000})

    class Config:
        coerce = True


class RawData(BaseSchema):
    property_type: Series[str] = pa.Field(isin=PROPERTY_TYPES)


class ProcessedData(BaseSchema):
    property_type_condo: Series[int] = pa.Field(isin=[0, 1])
    property_type_house: Series[int] = pa.Field(isin=[0, 1])
    property_type_townhouse: Series[int] = pa.Field(isin=[0, 1])

In the code above, we can see that we're defining a BaseSchema, which shares columns that are common between the raw and processed data. We're also making sure that the columns are coerced to the expected data types during validation.

RawData and ProcessedData inherit from BaseSchema, and just by looking at them we can see the difference that we expect between the raw and processed data: our process_data function should convert the property_type categorical variable into a set of dummy variables.

Runtime Data Testing

Now we use python's type-hinting syntax to annotate the process_data and train_model functions, decorating them with @pa.check_types to make sure that the inputs and outputs are validated at runtime:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame
from sklearn.linear_model import LinearRegression


@pa.check_types
def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    return pd.get_dummies(
        raw_data.astype({"property_type": pd.CategoricalDtype(PROPERTY_TYPES)})
    )

@pa.check_types
def train_model(processed_data: DataFrame[ProcessedData]):
    estimator = LinearRegression()
    targets = processed_data["price"]
    features = processed_data.drop("price", axis=1)
    estimator.fit(features, targets)
    return estimator

Now every time we run our pipeline our data is validated as it passes through the various transformations:

from io import StringIO


def run_pipeline(raw_data):
    processed_data = process_data(raw_data)
    estimator = train_model(processed_data)
    # evaluate model, save artifacts, etc...
    print("model training successful!")


run_pipeline(pd.read_csv(StringIO(raw_data.strip())))
model training successful!

So if we pass invalid data into run_pipeline, we should get an error:

invalid_data = """
square_footage,n_bedrooms,property_type,price
750,1,unknown,200000
900,2,condo,400000
1200,2,house,500000
"""

try:
    run_pipeline(pd.read_csv(StringIO(invalid_data.strip())))
except Exception as e:
    print(e)
error in check_types decorator of function 'process_data': <Schema Column: 'property_type' type=<class 'str'>> failed element-wise validator 0:
<Check isin: isin({'house', 'condo', 'townhouse'})>
failure cases:
   index failure_case
0      0      unknown

Here, pandera tells exactly what went wrong: the property_type column has an invalid category unknown at the 0th entry.

Property-based Testing

But wait, there's more! Since we've already defined our schemas, we can isolate the processing and model-training code from real-world data to test that each component in your pipeline is functioning as expected.

pandera builds on top of the hypothesis package to generate synthetic data from search strategies that try to find the simplest case that would falsify your tests:

import hypothesis

@hypothesis.given(RawData.strategy(size=3))
def test_process_data(raw_data):
    process_data(raw_data)

    
@hypothesis.given(ProcessedData.strategy(size=3))
def test_train_model(processed_data):
    estimator = train_model(processed_data)
    preds = estimator.predict(processed_data.drop("price", axis=1))
    assert len(preds) == processed_data.shape[0]

We could run this as a unittest or pytest suite, but for now we can just run the tests manually like so:

def run_test_suite():
    test_process_data()
    test_train_model()
    print("โœ… tests successful!")
    
    
run_test_suite()
โœ… tests successful!

So if we were to incorrectly implement any of the components in our pipeline, we'd see errors early on. In this case, we're just going to return the raw data without the dummified property_type variable.

@pa.check_types
def process_data(raw_data: DataFrame[RawData]) -> DataFrame[ProcessedData]:
    return raw_data


try:
    run_test_suite()
except Exception as e:
    print(e)
Falsifying example: test_process_data(
    raw_data=   square_footage  n_bedrooms  price property_type
    0               0           0    0.0         condo
    1               0           0    0.0         condo
    2               0           0    0.0         condo,
)
error in check_types decorator of function 'process_data': column 'property_type_condo' not in dataframe
   square_footage  n_bedrooms  price property_type
0               0           0    0.0         condo
1               0           0    0.0         condo
2               0           0    0.0         condo

Here, our test suite catches the fact that property_type_condo doesn't exist in our processed data output.

We can get some more intuition about what's going on with the data synthesis strategies by interactively generating data using the example method.

RawData.example(size=3)
square_footage n_bedrooms price property_type
0 2042 7 746370.512051 townhouse
1 9 5 40167.336181 townhouse
2 467 9 3707.778242 townhouse
ProcessedData.example(size=3)
square_footage n_bedrooms price property_type_condo property_type_house property_type_townhouse
0 1625 8 417650.777844 0 0 0
1 317 0 788150.855590 1 1 1
2 1937 7 710676.511681 1 0 1

Under the hood, pandera is collecting all of the schema properties and converting it into a search strategy using the pandas-supported hypothesis strategies. Currently, one limitation that you can see from the ProcessedData example above is that the generated data doesn't quite capture the joint distribution between the property_type_* dummy variables, as the second row contains 1s for all of the property types. Depending on what exactly it is you're trying to test, this may or may not matter. Ultimately, it's still up to you to determine what to test and how ๐Ÿค”.

What's Next?

There's still a lot to do in pandera to fully-realize the vision of statistical typing, but I think the main API ideas and features are there to get started and reap the benefits of statistical typing:

  1. Runtime data validation when executing pipeline during development/production.
  2. Property-based unit testing by isolating transformation code from real data.
  3. Self-documenting pipelines that explicitly define the types and statistical properties of data as it flows through your pipeline.

There are a few things in the roadmap that I'm excited about:

If you're interested in this project, please consider helping out with code contributions, submitting feature requests, bugs, documentation improvements, and support!