Source: https://unsplash.com/@kellysikkema

Running End-to-End Tests for ML Features

Carrying an AI model into production is, of course, not the same as getting traditional software into production (you’ve probably read what Google has to say about it but just in case…). In ML pipelines, the entire infrastructure is different: there are many teams involved in the process, the data is central, performance is measured with ML-specific KPI’s, and debugging does not involve ‘if/else’ code, etc.

The result of the unique nature of AI and the challenges it presents causes Data Science teams to believe they should rethink production readiness — if not reinvent the readiness wheel from scratch. In reality, however, they could benefit from a closer look at some of the strong traditional software delivery methodologies which have evolved over the past decade.

In this post, we’ll discuss Testing For Reliability, one such classic method, and see how it can be adopted effectively for quality assurance for one of the primary AI building blocks, the features.

The Prediction Cross-Sell Challenge

Let’s borrow this health insurance example from kaggle to illustrate our points in this post. An insurance company that provided health insurance to its clients is now looking into offering them vehicle insurance as well.

This classic cross-sell opportunity requires a new model to predict business results. As we can see, this is a binary classification model (a good fit for vehicle insurance or not) and we are depending on the following tabular data. Fortunately, the raw data provides several good features that we can use.

Source: https://www.kaggle.com/anmolkumar/vehicle-insurance-eda-lgbm-vs-catboost-85-83
Source: https://www.kaggle.com/anmolkumar/vehicle-insurance-eda-lgbm-vs-catboost-85-83

As we can see, the data consists of diverse types (categorical, numerical, binary, etc..), thus, the features will need to be tested separately, in different ways.

Is Success in the Eye of the Beholder?

Before starting to define tests and success criteria, it’s worth considering the different data perspectives of your stakeholders. More often than not, data is measured differently across the organization, so the need to align success criteria shouldn’t be overlooked.

In our insurance example, the data scientist is certain that outliers should be treated in a specific manner. The product manager, however, might have a different view due to her familiarity with specific segments representing different pillars of customers’ profiles. She would expect the model to reflect this ‘business reality’. The engineering team transforms the features into production code almost blindly, implementing transformations according to the baseline defined by the data science department. The potential misalignment go on and on.

So the first action the insurance Data science team should take — before planning any tests — is to collect input from all the relevant functions in the organization. It’s good practice to understand what each stakeholder expects of the data and apply a few quick tests just to make sure that everyone’s KPIs are covered and that the data is guardrail to ensure that coverage.

What Are We Testing?

Once expectations of the data have been clarified, it’s time to prepare the data so you can dive into test implementation. Let’s have a look at one such data example which can drive many related tests. Hopefully, you will find it relatable for your own models.

In the case of our insurance company, there was an unusual correlation between the age of health insurance customers and their driver’s license status:

  1. According to the CRM, potential vehicle insurance buyers should be 18–85 years old
  2. Most of the company’s health insurance customers were young
https://www.kaggle.com/anmolkumar/vehicle-insurance-eda-lgbm-vs-catboost-85-83

3. Younger customers were more likely to hold a driver’s license than older ones

0+ Tolerance

With so many tests available to implement, you want to be careful to apply the ones which provide the best coverage. Data, unlike traditional code, should be treated with tolerance to slight changes and differences. Choosing the right tests will eliminate all-too-common false-negative test runs.

Here are a few examples of tests that are likely to be productive:

  1. Validating that missing values are imputed as expected (mean / default value)
  2. Testing the correlation between features and target variables (covariance)
  3. Ascertaining whether the value range is between x — y with z% error tolerance
  4. Validating that feature values are not null and/or are within an acceptable range
  5. Testing whether each value meets a specific regex
  6. Outliers are scaled/clipped
  7. Numeric features are scaled correctly
  8. Standard deviation / Mean / Median should be between x and y
  9. Expected distributions of curated features
  10. These tests are relatively basic but constitute a decent head start for making testing an integral part of your process.

For the Age feature example, you can try to implement a few tests to support stakeholder expectations:

  1. Age value range test — age must be between 18–85.
  2. 50% of the data points are in the 24–36 age range. With a 6% tolerance.
  3. Age & driver’s license — 95% of customers aged 20–40 have a driver’s license. 50% of people aged 40–60 have a driver’s license, and 70% of people aged 60–85 do not have a driving license. With a 7% tolerance.

It’s Time to Test

Instead of building each test from scratch, a few great tools can come to your aid. Start warming up with Tesnsorflow data validation and great expectations. In my next post, I will elaborate on how these tools can be used to maximum benefit. Follow me for more ML testing & monitoring inspiration and assistance.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store