Safeguarding Personal Data: Best Practices for Managing PII and PHI in ETL Pipelines

Time to read
7 min
Image for blog post about safeguarding personal data
No Results Found
Table of Contents
  • Approaches to PII and PHI Management: Tailoring Strategies to Specific Use Cases in ETL Pipelines
    • Identify
    • Classify
    • Data minimization
    • Encrypt
    • Control access
    • Data retention and deletion
    • Data masking
    • Auditing
  • Conclusions

Navigating the complexities of data management requires a keen focus on the handling of Personally Identifiable Information (PII) and Personal Health Information (PHI), particularly within Extract, Transform, Load (ETL) pipeline solutions. With regulations like GDPR, CCPA, and HIPAA setting stringent requirements, organizations must adapt their ETL processes to ensure compliance and protection of sensitive data. As we stand on the brink of a major AI revolution, let's delve into the challenges and strategies for securing PII and PHI in ETL pipelines.

Approaches to PII and PHI Management: Tailoring Strategies to Specific Use Cases in ETL Pipelines

Consider the situation in a call center where staff members manage clients' personal data. For instance, a client may call to request the cancellation of a lost credit card. You want the call center representative to be able to identify the client, yet you do not want to show him the credit card number, even though the system contains this information. Neither do you want to share how much money is on the card. Data may substituted with random numbers, or even hashed out into ***, leaving only the last four digits, for validation purposes.

Imagine that the same data set has to be sent to the analytics team to analyze credit card usage by age group and geographic location. You can't train the model on fictitious data, but you must also hide the actual values. You want to hash the debit amounts in such a way that they remain relevant. Whatever model you train on test data must also work well on real data.

The crux of personal information (PII) management is the delicate handling of sensitive data once it has been identified, among other important aspects. What's more, every element of the process must work together to achieve seamless synchronization.

Organizations are obliged to implement a rigorous and detailed process for dealing with PII within their Extract, Transform, Load (ETL) workflows. This process can be broken down into coherent steps or categories, each serving to bolster compliance and safeguard sensitive data.

Identify

The first step is to identify what constitutes PII data within your dataset. This usually depends on which specific regulation you want to follow and adhere to, as each one has its way of classification. Potential categories include names, addresses, phone numbers, email addresses, social security numbers, and any other data that can be used to directly or indirectly identify an individual.

Classify

Classify data elements as either PII or non-PII. Create a comprehensive catalog of all PII fields in your data sources, including metadata such as data types and sensitivity levels.

Data minimization

Limit the collection and storage of PII to only what is necessary for your business processes. Avoid collecting excessive data: as in our introductory examples, call center employees have no benefit in seeing the exact card numbers. Each digit might be mapped to a different one (substitution). The same can happen with home address (pseudonymization). Call center representatives need certain client attributes to validate they’re talking to the right person, but that’s about it.

Encrypt

Implement encryption mechanisms for both data in transit and data at rest. Ensure that PII data is encrypted during ETL operations and while stored in data repositories.

Control access

Limit access to PII data to only authorized employees who need it to perform specific tasks. Implement role-based access control (RBAC) and regularly check and audit access permissions.

Modern solutions often offer fine-grained controls - you can use column-level security (CLS) to mark specific columns as sensitive (credit card number), row-level security (RLS) to do the same with rows of data (customers from a specific country), or a combination of both. This way, some users will be limited to a subset of columns when querying the data and may not be able to see some records.

But even if you apply additional measures, you can never be completely safe. Sometimes the database is downloaded as a full copy to local computers, where it may reside indefinitely. And end users often behave inappropriately by downloading random applications from the Internet. This "shadow IT" is a serious threat and can put data at risk.

Provide training to all employees and contractors who handle PII data. Inform them about data privacy policies, security best practices, and the importance of protecting sensitive information.

Data retention and deletion

Establish clear policies and procedures for data retention and deletion. Ensure that PII data is only retained for as long as necessary, and implement secure deletion mechanisms when data is no longer needed. Sometimes such procedures are even enforceable by law.

Data masking

The essence of the whole approach is based on various methods of data masking. There are several layers, from simple obfuscation/shuffling to full and irreversible data anonymization. If possible, use data masking or anonymization techniques to replace or hide PII data with fictitious values or aliases during ETL processing. This ensures that sensitive information is not unnecessarily disclosed.

If you plan to share the full data set with the data science team, you need to make sure that the PII columns retain their original statistical significance. Basic measures such as standard deviation and median must not deviate too much from the original data, otherwise they are statistically worthless.

It's worth noting that you don't need real PII data to build the model, and if the "copy" retains the original "shape," the trained model will work just as well when plugged back into the original production data.

There is a long debate about whether hashing provides a strong enough mechanism for anonymizing data. Skeptics raise serious concerns, pointing out that many secrets can be cracked by brute force - as long as you know the mixing algorithm used, you can find out a credit card number simply by mixing every possible combination of digits - not so unlikely in a world where quantum computing is becoming a real possibility. Defendants argue that hashing more than one category of PII together avoids risk - say you combine a credit card number, full name, and full address into one long string, and only then hash its contents.

A common misconception is that by adding custom code, you can make your solution bulletproof. For example, after processing the identified PII values, a function is applied that adds a salt - a small random value that modifies the original one. However, in the calculation, the salt is not fully random, but rather non-deterministic. At the lowest level, it still relies on something like an internal clock, and such a function can potentially be reverse-engineered.

A famous idea, called Kirchoff's principle, states that one should not rely on so-called security-by-obscurity - simply put, one should always assume that everything outside the private key will at some point come out and become known to everyone. Any code base, no matter how well hidden, will be stolen, stolen, or simply placed in some unsecured repository or sent in an unencrypted message by one of your employees (the latest case would be sharing code with one of the artificial intelligence models, such as ChatGPT). If hackers can deduce the algorithm once it becomes public, that means you're doing something wrong. If your cloaking algorithm obscures the values to the point that reverse engineering is impossible, even if you can see the function that did the job, your data is safe. And most often this is the right way to handle such a situation.

Auditing

Maintain detailed audit logs of all PII data activities. This includes tracking data movement, transformation, and access. Audit logs are essential for monitoring and ensuring compliance. Implement data quality control and validation rules in the ETL pipeline to ensure the accuracy and integrity of PII data. Reject or quarantine data that does not meet these criteria.

If you engage third-party vendors or cloud providers in your ETL pipeline, make sure they also adhere to strict data privacy and security standards. Follow common data management practices to monitor who has access to what. Perform due diligence and establish clear agreements for handling PII data.

Cloud providers offer several options for easily dealing with PII. AWS Glue allows you to detect PII. You can mask or completely remove data flowing through the pipeline. You can also use a cryptographic hash function.

Azure relies primarily on Presidio, available through Azure App Service, an external HTTP endpoint that uses both regex pattern recognition and natural language methods (NER, named entity recognition). You can feed it with your own PII data categories that you are interested in.

GCP recently introduced a framework (Data Loss Prevention API) within BigQuery for detecting, classifying, and tagging PII data. You can manage permissions for end users. A so-called inspection task scans tables using an inspection template. The results are materialized in the selected storage type.

Conclusions

We are just scratching the surface here. Without a doubt, data should be secure throughout the full processing cycle, not only during transmission but also at rest. Here, modern databases already give us many opportunities to enforce strict control. The discussion of this part is sufficient material for another paper. The author recommends checking this aspect as well.

In summary, handling PII data in an ETL pipeline requires a comprehensive approach that combines technology, policies, and procedures. Prioritizing data privacy and security throughout the data lifecycle is essential to protect both the sensitive information of individuals and the organization’s reputation. Regularly review and update your PII data handling practices to adapt to changing regulations and security threats.

And if you're looking for development experts who always put the security of your and your customers' data first, contact us. We can audit your existing software or help you build one from scratch.

Get your free ebook

Download ebook
Download ebook
Python Powerhouse
Share this post