Data Anonymization 101
What is Anonymization?
Anonymization (also called de-identification) is the process of protecting the identities of individuals represented in data. Anonymization has been an area of active research for many years, and there are several well-known methods, but any anonymization method removes information from the raw dataset, and therefore risks lowering its utility. In this article, for context and background, we explore some of the standard anonymization techniques used in the industry.
Tip
Anonymization protects individuals in your data when that data is released. It is a good complement to other pillars of data security and governance, which limit access to data.
How Individuals Get Identified in Data
Some data contains Direct Identifiers (DIDs), which can, by themselves, be used to identify an individual. DIDs include names, addresses, social security numbers, biometrics, IP addresses, and much more.
In addition, nearly all data includes indirect or Quasi-identifiers (QIDs). Groups (tuples) of QIDs can be combined to uniquely identify an individual. To someone with "external knowledge" (such as an acquaintance or an attacker with a third-party dataset), nearly any observable trait can be considered a QID. Famously, 87% of Americans are uniquely identifiable by their birthday, zip code, and gender. But other linkage attacks have used public IMDB data and paparazzi photos to re-identify individuals in released datasets.
Removing Direct Identifiers
Removing Direct Identifiers from a dataset is known as pseudonymization. It is a necessary, but insufficient, step towards anonymizing data.
There are several ways to suppress identifers:
- Redacting: completely deleting, removing, or omitting the identifier.
- Masking: hiding all or part of the identifier, e.g. using only the last four digits of a credit card number.
- Faking: replacing the identifier with randomly-generated but format-preserving content.
- Hashing: using a one-way function to map the identifier to a new unique identifier that cannot be linked to an external dataset.
- Encrypting: encoding an identifier in a way that allows for the recovery of the raw value only by using a private key or password.
The methods above can be used interchangeably or combined, depending on the use case.
Treating Quasi-identifiers
Full anonymization (also called de-identification) also requires the treatment of quasi-identifers. QIDs can be suppressed (like DIDs), but other methods for treating QIDs may preserve more data utility than suppression.
Options for treating quasi-identifiers include:
- Generalization: reducing the granularity of the data, e.g., using a year instead of an exact date.
- Perturbation: changing the values of QIDs by adding random noise, swapping values, or sampling from a synthetic dataset (modeled population).
- Aggregation: a form of Perturbation, Aggregation replaces values with an aggregate (e.g., median or mode) from a group or cluster of related records in the dataset.
Naive approaches to protecting QIDs can destroy the utility of data and obfuscate important relationships. However, state-of-the-art techniques can effectively de-identify datasets without degrading the data's utility, especially in large datasets.
Uniqueness and k-anonymity
k-anonymity is a measure of uniqueness in a dataset. A dataset is k-anonymous if every record in the dataset is indistinguishable from k-1 others. For example, a dataset containing two quasi-identifiers (age and gender) is 5-anonymous if there are no combinations (tuples) of age and gender that appear in the data fewer than 5 times.
Generalizing QIDs to achieve a k-anonymity target is a common approach to creating privacy-safe data.
Differential Privacy
This article is mostly concerned with de-identifying raw datasets. Sometimes it is necessary to protect published statistics of the data instead. For this, Global Differential Privacy (DP) is the gold standard approach. DP effectively adds noise to computed statistics, which protects individuals in the source data from an Inference Attack.
Global Differential Privacy is not appropriate when access to row-level data is required, and is not a good fit for exploratory data analysis. Local Differential Privacy is a form of Perturbation that is used for this purpose, but the utility of the treated data is typically lower than that from other de-identification approaches.
Protecting Sensitive Attributes
Beyond identifiers, sensitive attributes are features in data that cannot be used to identify someone, but that could provide new and valuable information to an attacker, and therefore increase the risk associated with a disclosure or breach.
Using group-based privacy measures like k-anonymity may not be sufficient to protect sensitive attributes. For example, in a k-anonymized dataset of medical diagnoses, it may be possible for an attacker to identify an individual as one of several subjects in a group. If the entire group has a positive diagnosis, then the attacker could assume that the individual also has a positive diagnosis.
Datasets that contain sensitive attributes should take additional care to minimize the risk of sensitive attribute disclosure by perturbing sensitive attributes (possibly using Local Differential Privacy) and increasing k.
The Benefits of Privacy-safe Data
There are two important considerations when using de-identified data:
- There are no "perfectly safe" datasets, especially rich datasets with many features. For example, subjects may be able to re-identify themselves in an anonymized dataset. However, anonymization can drastically increase the amount of external knowledge an attacker needs to re-identify individuals, and therefore lower the risk of a large-scale re-identification of the subjects in your data.
- Any de-identification treatment removes information from the source dataset. Depending on what information is removed (and how), this treatment has a negative effect on the utility of the data.
In other words, there is an explicit tradeoff between privacy and utility; depending on your use case, you may need to target more- or less-private data to achieve your business objectives.
However, once data has been appropriately anonymized, it typically is not subject to restrictions imposed by regulations like GDPR, HIPAA, CCPA, and others. This makes it easier and more ethical for you to share and use data in your organization for a wide variety of applications, like analytics, software testing, and much more.