Data Validation in a Big Data Environment on the Example of Apache Spark and Great Expectations

What is data validation?

Data validation is the process of ensuring data has undergone data cleansing to make sure it has data quality, which means that it’s both correct and useful. Data validation uses routines, often called “validation rules,” “validation constraints,” or “check routines.” They check for correctness, meaningfulness, and security of the data that is input to the system.

There are many types of data validation, but the most common are:

type validation—used to check whether data is of a given type, for instance, that a number is an int, not a float;
range and constraint validation, e.g. human height is smaller than 300 cm and greater than 0, because historically we can tell that the tallest person who has ever lived was shorter than 300 cm;
code and cross-reference validation—a given value is in [a,b,c];
structured validation (more complex validation);
consistency validation, to check, for example, whether the date of sales happens before the date of shipping.

The term “data validation” is understood as a number of automated, rules-based processes aiming to identify, remove, or flag incorrect or faulty data. As a result of application of data validation, we achieve a clean set of data.

Data validation became crucial as more and more companies and organizations rely upon bigger and more complex data sets, collected from different sources to draw insights and make crucial business decisions.

Our use case: reviewing data with Great Expectations

For the purpose of this article, we will analyze Airbnb bookings data from London. We’ll focus on booking reviews that, even though the reviews.csv file is only 443 MB in size, will allow us to see some benefits of using Great Expectations (which from now on I’ll be referring to as “GE”) with Spark.

We’ll try to find reviews for listings without profanities, meaning we’ll need to write custom expectation checking for profanities using a simple profanity checker.

First, we need to format the reviews file. It comes with reviews separated with new line characters and looks like this:

Spark will parse it to something like this:

As we can see above, some comments are multi-line and it would be quite hard to come up with a schema representing that dataset. So, to make our life easier, we need to remove break lines in the comments.

After that, we have:

AirBnB data bookings from London - formated data

Quick side note: all examples were run locally but GE provides suitable documentation describing how to deploy great_expectations on your cluster (section deployment).

Great Expectations also offers two versions of its API: V2 Batch Kwargs and V3 Batch Request. To my knowledge, V3 API is currently marked as experimental, so I’ll focus on V2.

What is Great Expectations?

According to its GitHub page, Great Expectations helps data teams eliminate pipeline debt through data testing, documentation, and profiling. Being one of the most popular validation tools and libraries in the Python environment (5,500 stars on GitHub), it’s certainly a good candidate to check out.

GE has extensive support for not only Pandas and Spark, but also many other data sources and pipeline frameworks. It allows you to work both from code and CLI, and we’ll try both of these approaches.

Apache Spark is a very fast and powerful analytics engine that has become the de facto standard when it comes to analyzing huge datasets. According to its website, 80% of companies from the Fortune 500 list use the engine in their pipelines.

We’ll cover the following steps required to work with GE:

create Data Context (either from code or from CLI),
configure Data Source,
create Expectations Suite (also using built-in automated Profiler),
write a custom Expectation,
check Data Docs to check the quality of your data.

We can work with GE either from CLI or the code.

Using Great Expectations

To start playing with GE after installation, we need to create Data Context. This is GE’s central configuration point that stores all the validation rules and data sources.

Great Expectations provides two ways of interacting with its API:

Code—allows us to make full use of GE.
CLI—easy to use and quite powerful for most use cases and in places when some code is needed. For instance, with the creation of Expectations suite, CLI will redirect to Jupyter Notebook.

For most of the time, we’ll stick with CLI.

Typing the great_expectations init command will initialize a new Data Context.

Creating Data Source configuration, which stores all the data source info, can be achieved in the init phase or by running the great_expectations datasource new command.

running the great_expectations datasource new command

Next, let’s run the great_expectations datasource list command. We can list our data sources as follows:

Now, we can create some new expectations by using a command scaffold as shown below:

great_expectations suite scaffold airbnb

After some processing, a prompt should redirect us to a new Jupyter Notebook:

prompt redirecting to a new Jupyter Notebook

We can run Expectations, which were automatically generated based on GE’s data profiling feature. However, they are not meant to be run on production because they are just generated based on statistical properties of data and probably won’t fulfill any meaningful purpose. As a result, they should instead serve as a foundation for defining some more meaningful Expectations.

We can do this by adding Expectations to the batch object that is an instance of SparkDFDataset and holds some reference to spark dataFrame.

After running every cell, GE should launch Data Docs in a new tab of a browser.

Let us modify the above code.

I’ve replaced default batch_kwargs by removing all parameters beside the “datasource” and adding DataFrame with a schema (I couldn’t find a way to load the schema directly to Great_Expectations). It’s done by moving the parsing CSV directly to the spark.csv method, rather than alowing GE to parse it all by itself. This way, we have access to properly labeled columns.

Now, we can define some Expectations. After running the last cell, we will be redirected to the browser with a generated Data Doc.

Here we can see some failed Expectations. In fact, all of them failed because all of those mentioned columns contain some null data. Now, in order to proceed, we need to remove this null data.

Checkpoints in Great Expectations

Sometimes we want to incorporate multiple validation of batches in one Expectation Suite and define some actions upon those validation results, such as alerts.

Checkpoints are abstractions made to fulfill that purpose. They are managed using a Data Context and have their own Store that is used to persist their configurations to YAML files.

Running checkpoints allows us to generate data docs, and we can generate a script that runs a checkpoint on batches of data in order to connect it to our pipeline that validates data.

Profiling capabilities of Great Expectations

When new datasets come in, we may want to get a quick idea of what the data looks like. Although Great Expectations puts more focus on data validation than data profiling, its automatic profiling has the ability to create a summary statistics report which provides a comprehensive overview, including the number of variables, observations, missing cells, and data types.

Profiling then provides a summary of value counts and distribution for each column. GE can generate Expectations based on these statistics, but they are quite generic and would need some further tuning.

Profiling can be achieved both during the init phase of playing with CLI or later while working on a dataset with CLI and also while using code. Below we show both examples with code and with the use of CLI:

1. Code

2. CLI

during initialization
during the later stages of analyzing the dataset
simply running great_expectations datasource profile airbnb

Profiling capabilities of Great Expectations - CLI

As in the case of other features, Profilers in Great Expectations are designed to be extensible. You can develop your own Profiler by subclassing DatasetProfiler, or from the parent DataAssetProfiler class itself.

Extensibility of Great Expectations

Sometimes we can’t simply find an Expectation that would meet our requirements. In this case, due to Great Expectations’ extensibility via plugins, we can simply write new Expectations by extending the SparkDFDataset, which is GE’s wrapper for Spark’s dataFrame API.

In our example, we are working on reviews, and we would like to check whether any of the reviews in our dataset contains any profanities. To do so, we will write custom Expectations.

We’ll use a package called Profanity.

We will follow the steps described in this guide. Because we’re working with a file system and CLI, our situation is quite simple, but in a situation when you don’t have access to a file system (AWS EMR SPARK or Databricks cluster), you’ll need to take a different approach.

So, to create a new Expectation, we need to:

1. Write the code

2. Put the code to our great_expectations/plugins

3. Update our DataSource definition so it utilizes our new CustomSparkDF

4. Run great_expectations

Edit the suite_name (in this case Airbnb), choose the data on which you want to edit the suite, and add an Expectation to the Expectations section in order to check if all the steps were made correctly, as shown below.

Here we need to add that Profanity is not the most correct profanity detector and in the comment shown above it labeled the word “muffins” as profanity.

The image above shows that there are, indeed, some comments that contain profanities.

a) Save the checkpoint with the newly created Expectations to generate documentation

b) Display the documentation

Final thoughts on data validation in a big data environment

In this article, we weren’t trying to showcase all of the numerous possibilities and features of Great Expectations. Instead, we wanted to display a few of the most basic features in conjunction with Apache Spark, working on some real-life examples.

Data validation and Great Expectations are both quite extensive topics, but we hope that with this article, we made it a little easier to grasp.

Here at STX Next, we have a team of Data Engineers who are passionate about finding solutions to our clients’ problems. They’ve also written insightful blog posts to share their knowledge:

If you struggle with any data engineering problems and would like to chat with our specialist, feel free to drop us a message. We’d be happy to work with you to find the best solution!