Securing PII & PHI in ETL Pipelines: Best Practices

Approaches to PII and PHI Management: Tailoring Strategies to Specific Use Cases in ETL Pipelines

Consider the situation in a call center where staff members manage clients’ personal data. For instance, a client may call to request the cancellation of a lost credit card. You want the call center representative to be able to identify the client, yet you do not want to show him the credit card number, even though the system contains this information. Neither do you want to share how much money is on the card. Data may substituted with random numbers, or even hashed out into ***, leaving only the last four digits, for validation purposes.

‍Imagine that the same data set has to be sent to the analytics team to analyze credit card usage by age group and geographic location. You can’t train the model on fictitious data, but you must also hide the actual values. You want to hash the debit amounts in such a way that they remain relevant. Whatever model you train on test data must also work well on real data.

‍The crux of personal information (PII) management is the delicate handling of sensitive data once it has been identified, among other important aspects. What’s more, every element of the process must work together to achieve seamless synchronization.

Implementing robust data security measures is crucial to prevent data breaches, especially in light of regulations like the CCPA.

‍Organizations are obliged to implement a rigorous and detailed process for dealing with PII within their Extract, Transform, Load (ETL) workflows. This process can be broken down into coherent steps or categories, each serving to bolster compliance and safeguard sensitive data.

Identify

The first step is to identify data that constitutes PII within your dataset. This usually depends on which specific regulation you want to follow and adhere to, as each one has its way of classification. Potential categories include names, addresses, phone numbers, email addresses, social security numbers, and any other data that can be used to directly or indirectly identify an individual.

Classify

Classify data elements as either PII or non-PII. Create a comprehensive catalog of all PII fields in your data sources, including metadata such as data types and sensitivity levels.

Data minimization

Limit the collection and storage of PII to only what is necessary for your business processes. Avoid collecting excessive data: as in our introductory examples, call center employees have no benefit in seeing the exact card numbers. Each digit might be mapped to a different one (substitution). The same can happen with home address (pseudonymization). Call center representatives need certain client attributes to validate they're talking to the right person, but that's about it.

Encrypt

Implement encryption mechanisms for both data in transit and data at rest. Encryption should be implemented not only during data transmission but also when data resides in storage systems, such as the file system. Ensure that PII data is encrypted during ETL operations and while stored in data repositories.

Control access in accordance with data privacy regulations

Limit access to PII data to only authorized employees who need it to perform specific tasks. Implement role-based access control (RBAC) and regularly check and audit access permissions.

Healthcare organizations face unique challenges in managing Personally Identifiable Information (PII) and Protected Health Information (PHI) effectively, particularly in safeguarding against cyber threats and financial losses resulting from data breaches.

Modern solutions often offer fine-grained controls - you can use column-level security (CLS) to mark specific columns as sensitive (credit card number), row-level security (RLS) to do the same with rows of data (customers from a specific country), or a combination of both. This way, some users will be limited to a subset of columns when querying the data and may not be able to see some records.

But even if you apply additional measures, you can never be completely safe. Sometimes the database is downloaded as a full copy to local computers, where it may reside indefinitely. And end users often behave inappropriately by downloading random applications from the Internet. This “shadow IT” is a serious threat and can put data at risk.

Provide training to all employees and contractors who handle PII data. Inform them about data privacy policies, security best practices, and the importance of protecting sensitive information.

Data retention and deletion

Establish clear policies and procedures for data retention and deletion. Clear policies and procedures for data retention and deletion are essential for regulatory compliance. Ensure that PII data is only retained for as long as necessary, and implement secure deletion mechanisms when data is no longer needed. Sometimes such procedures are even enforceable by law.

Data masking

The essence of the whole approach is based on various methods of data masking. There are several layers, from simple obfuscation/shuffling to full and irreversible data anonymization. If possible, use data masking or anonymization techniques to replace or hide PII data with fictitious values or aliases during ETL processing. This ensures that sensitive information is not unnecessarily disclosed.

It is also crucial to manage PHI data during data masking, as PHI data often requires special handling due to regulatory requirements, including the need for cleansing, standardization, and encryption to uphold strict confidentiality and compliance standards.

If you plan to share the full data set with the data science team, you need to make sure that the PII columns retain their original statistical significance. Basic measures such as standard deviation and median must not deviate too much from the original data, otherwise they are statistically worthless.

It’s worth noting that you don’t need real PII data to build the model, and if the “copy” retains the original “shape,” the trained model will work just as well when plugged back into the original production data.

There is a long debate about whether hashing provides a strong enough mechanism for anonymizing data. Skeptics raise serious concerns, pointing out that many secrets can be cracked by brute force - as long as you know the mixing algorithm used, you can find out a credit card number simply by mixing every possible combination of digits - not so unlikely in a world where quantum computing is becoming a real possibility. Defendants argue that hashing more than one category of PII together avoids risk - say you combine a credit card number, full name, and full address into one long string, and only then hash its contents.

A common misconception is that by adding custom code, you can make your solution bulletproof. For example, after processing the identified PII values, a function is applied that adds a salt - a small random value that modifies the original one. However, in the calculation, the salt is not fully random, but rather non-deterministic. At the lowest level, it still relies on something like an internal clock, and such a function can potentially be reverse-engineered.

A famous idea, called Kirchoff’s principle, states that one should not rely on so-called security-by-obscurity - simply put, one should always assume that everything outside the private key will at some point come out and become known to everyone. Any code base, no matter how well hidden, will be stolen, stolen, or simply placed in some unsecured repository or sent in an unencrypted message by one of your employees (the latest case would be sharing code with one of the artificial intelligence models, such as ChatGPT).

If hackers can deduce the algorithm once it becomes public, that means you’re doing something wrong. If your cloaking algorithm obscures the values to the point that reverse engineering is impossible, even if you can see the function that did the job, your data is safe. And most often this is the right way to handle such a situation.

Auditing to prevent data breaches

Maintain detailed audit logs of all PII data activities. This includes tracking data movement, transformation, and access. Audit logs are essential for monitoring and ensuring compliance. Maintaining detailed audit logs is essential for ensuring data protection and compliance with data privacy regulations. Implement data quality control and validation rules in the ETL pipeline to ensure the accuracy and integrity of PII data. Reject or quarantine data that does not meet these criteria.

If you engage third-party vendors or cloud providers in your ETL pipeline, make sure they also adhere to strict data privacy and security standards. Follow common data management practices to monitor who has access to what. Perform due diligence and establish clear agreements for handling PII data.

Cloud providers offer several options for easily dealing with PII. AWS Glue allows you to detect PII. You can mask or completely remove data flowing through the pipeline. You can also use a cryptographic hash function.

Azure relies primarily on Presidio, available through Azure App Service, an external HTTP endpoint that uses both regex pattern recognition and natural language methods (NER, named entity recognition). You can feed it with your own PII data categories that you are interested in.

GCP recently introduced a framework (Data Loss Prevention API) within BigQuery for detecting, classifying, and tagging PII data. You can manage permissions for end users. A so-called inspection task scans tables using an inspection template. The results are materialized in the selected storage type.

Ensuring Data Privacy and Security in ETL Pipelines

We are just scratching the surface here. Without a doubt, data should be secure throughout the full processing cycle, not only during transmission but also at rest. Here, modern databases already give us many opportunities to enforce strict control. The discussion of this part is sufficient material for another paper. The author recommends checking this aspect as well.

Handling PII data in an ETL pipeline requires a comprehensive approach that combines technology, policies, and procedures. Prioritizing data privacy and security throughout the data lifecycle is essential to protect both the sensitive information of individuals and the organization’s reputation. Regularly review and update your PII data handling practices to adapt to changing regulations and security threats. Prioritizing regulatory compliance and implementing robust security measures are essential to mitigate the risks of data breaches.