Graham Thompson
· 11 min read
Anonymization vs Pseudonymization: The Key Differences
Data anonymization and data pseudonymization have fundamental differences that organizations need to be aware of to meet their obligations under modern data protection laws.
Data protection can be a minefield for the uninitiated, but understanding the fundamentals is crucial to ensuring your business is, and remains, compliant with modern security standards.
Data is one of the world’s most valuable currencies. When properly structured and analyzed, organizations can extract value from data for marketing purposes, consumer profiling, and identifying patterns that assist them in streamlining operations and uncovering new revenue streams.
However, the introduction of stringent data protection standards, such as the EU’s General Data Protection Regulation (GDPR), underscores the importance of understanding your duties as a data controller and how you should handle and manage the data you store.
To achieve this, data controllers must understand key definitions, including the differences between ‘anonymized’ data and ‘pseudonymized’ data. While sometimes used in an interchangeable fashion, whether or not data would legally be considered anonymized or pseudonymized changes how it needs to be protected.
Below, we explain the fundamental differences between anonymized data and pseudonymized data, why GDPR matters, and how data should be processed to adhere to modern data protection compliance standards.
What is anonymized data?
Anonymized data is information that has been processed to eradicate markers that could be used to identify individual subjects.
Markers can be directly or indirectly identifiable, and include names, aliases, physical addresses, email addresses, dates of birth, gender, telephone numbers, and biometric records, such as fingerprint or retina scans.
Anonymized (also often referred to as de-identified) data can be used for research purposes and in academia and is often extremely valuable for analytical purposes. For example, a business could process an anonymized dataset to identify patterns in consumer habits, or a researcher could use records stripped of personal information to analyze demographic disease patterns.
True anonymization, however, can be difficult, as organizations must also consider the indirect markers that may still exist within datasets, despite their best efforts. Even if an organization removes direct markers, it must also consider indirect markers that could unmask ‘anonymized’ data if multiple sources of information are linked together.
The most common indirect markers that attackers use to re-identify individuals are date of birth, gender, ethnicity, and zip code.
There may also be borderline cases where particular datasets or information belonging to individual subjects would be difficult to anonymize, such as when rare medical conditions are recorded. In these cases, effectively obfuscating already narrow parameters could be a challenging and complicated prospect.
The goal of anonymization processes is to irrevocably change a dataset to prevent individual subjects from being identified. If organizations intend to use, store, or release such datasets, they must take reasonable steps to prevent the re-identification of individuals connected to such information.
What is pseudonymized data?
Pseudonymization is a temporary and reversible way of removing identifiable markers from personal information and is subject to the privacy and security rules in all major data privacy regulations including GDPR, HIPAA, and CPRA.
Pseudonymization is a privacy-enhancing measure designed to reduce the chance of attributing held information to humandata subjects—but there still may be an element of risk in certain situations.
This can be achieved in various ways, including the use of pseudonyms—hence, the name—removing personal categories of data, and by using cryptographic hashes and encryption.
Organizations most commonly opt for pseudonymization rather than full anonymization because pseudonymized data retains their ability to recover the original data. There’s only one problem - pseudonymized data doesn’t meet the regulatory requirements driving the effort in the first place! At best, pseudonymised data can be a helpful tool for achieving data minimization, where personal data is only accessed when absolutely necessary. .
In Article 4 of the EU’s GDPR regulations, pseudonymized data is defined as, “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information.” This is often a code, token, or hash value.
Additional information that may unmask personal information has to be kept separately and in a secure manner, and this needs to be achieved through the use of appropriate technologies and organizational security practices.
Overall, pseudonymized data has been scrubbed of identifying markers, but it is still possible to identify subjects if additional datasets, knowledge, and tools are applied. It is not a silver bullet allowing data owners free reign on protected personal information.
What is the difference between anonymization and pseudonymization?
The difference between anonymization and pseudonymization lies in one central question: can the dataset be transformed from non-identifiable to identifiable with additional information or the right tools and technologies?
Anonymized data should have been processed to remove any markers that would revert records back to personal, identifiable information, and this should be a permanent change. However, it may be possible for pseudonymized data, which is still considered ‘personal’ under many privacy laws, to be altered to re-identify individual subjects.
A pseudonymized dataset can be linked to further identifiers that unmask individuals, whereas this is not be possible with anonymized data. Now we can go down the academic rabbit hole here and talk about all of the data an attacker may have access to in order to re-identify someone, but this is a situation where taking a reasonable, defensible, and logical approach is best.
It’s worth noting that the introduction of consumer artificial intelligence (AI) changes the calculation of what an attacker may have access to, and its certainly a strong argument to increase the level of anonymization that should be applied to any datasets being published on the public internet. However, given current technological limitations, data protection laws cannot and do not hold data controllers and their organizations accountable for future possibilities.
The techniques used in anonymization and pseudonymization:
Below are the most common methods employed to anonymize or pseudonymize datasets:
- Removing identifiers: Often, the first step in anonymization or pseudonymization is to remove or redact personally identifiable information (PII) that could reveal the identity of data subjects. This list includes 18 data types: names, aliases, personal identification numbers, street address, email address, phone numbers, biometric data, IP/MAC addresses.
- Generalization: Anonymization requires data to be generalized. Removing identifiers is the first step, but for true anonymization, data points must be scrubbed to the point that there are no direct identifiers. K-Anonymity is an example of generalization that applies mathematical rules to anonymization to reduce re-identifiability.
- Pseudonyms: Another method is replacing fields and identifiers with pseudonyms, improving the chance of keeping an identity hidden while also reducing the need to omit valuable data points. For example, a business could remove an individual’s name and replace it with a random number assignment.
- Obfuscation: Data controllers can also obfuscate personal indicators by replacing them with other values, making it difficult to understand the dataset or extract identifiers without prior knowledge and an understanding of the variable changes.
- Noise: Sensitive information can also be masked by adding randomized data or variables, thereby scrambling the original dataset and enhancing subject privacy.
- Splitting: Organizations can perform pseudonymization by splitting datasets and storing different variables and identifiers in secure environments. Applying the principle of least privilege to these environments can bolster existing data protection standards.
- Hashing: Applying cryptographic hashes to mask sensitive data, such as Social Security numbers, is often considered a “one way” process and can make it more difficult for casual viewers to decipher while masking the original dataset.
Real-world examples of anonymization and pseudonymization
Let’s explore the following example of a hospital patient record held at a hospital. The original record contains their full, personal healthcare record, and then undergoes pseudonymization, and finally, anonymization:
Full record: David Smith | Address: 1 Stone Street, New York, NY 100005| Telephone number: +0 857 356 | DOB: July 7, 1971 | Male | Diabetes
Pseudonymized: Patient123 | Address: [token] New York, NY 100005 | Telephone number: [token] | DOBAge: July 7, 1971 | Male | Diabetes
Anonymized: Patient123 | Zip 100005 | Age: 52 | Male | Diabetes
The pseudonymized record removes important variables that could be used to identify the individual and also uses a pseudonym instead of the patient’s name. However, with access to the original dataset or other connected data points, it is possible to re-identify the individual.
However, in the case of the anonymized dataset, it would be impossible to identify the subject when hidden within a generalized dataset.
Let’s look at another example of pseudonymizing and anonymizing a record. This one involves a student’s school record.
Full record: Claire Smith | Address: 2 Stone Street | Gender: Female | DOB: Jan 27, 2008 | GPA: 4.7 | Favorite subject: Math
Pseudonymized: Student234 | DOB: Jan 27, 2008 | Gender: Female | GPA: 4.7| Favorite subject: Math
Anonymized: Student 234 | Age: 16 | Gender: Female | GPA: 4.5 (adjusted) | Favorite subject: Math
It would be possible to track down the name of the student when the data has been pseudonymized if you had access to the original school records. In this instance, the school might opt for pseudonymization rather than anonymization as the dataset retains valuable identifiers for an analysis of the school’s performance, and whether there are correlating factors between age and gender to GPA, for example.
In comparison, the anonymized dataset is not linkable, containing only GPA data, but has lost valuable variables for academic research and analysis.
How anonymized and pseudonymized data fall under GDPR
Understanding whether or not data is anonymized or pseudonymized will impact if and how data is protected under GDPR and whether or not an organization is permitted to collect, store, and use the personal information.
GDPR was introduced in 2018 and was developed to replace a 1995 data protection directive that, due to its age and the rapid technological innovations of the last few decades, did not apply well to modern data and privacy concerns.
GDPR provides guidance to organizations on their responsibilities as data custodians—but anonymized data does not tend to fall under the GDPR framework.
While true anonymization can be challenging to achieve, time-consuming, complicated, and may devalue the subject data, taking this approach to data protection and management also loosens legal restrictions on data usage and sharing.
According to the European Data Protection Board (EDPB), anonymized information is defined as “a process that consists in using a set of techniques to make personal data anonymous in such a way that it becomes impossible to identify the person by any means that are reasonably likely to be used.”
Furthermore, the latest guidance issued by the UK Information Commissioner's Office (ICO) states, “effective anonymization reduces identifiability risk to a sufficiently remote level.”
Today’s data controllers do not need to consider theoretical ways that anonymization could be broken, such as through future technologies, but they are charged with the “means reasonably likely to be used” when assessing if a subject could become identifiable.
“You should base this on objective factors such as the costs and time required to identify, the available technologies, and the state of technological development over time,” the ICO adds. “When considering releasing anonymous information to the world at large, you may have to implement more robust techniques to achieve effective anonymization than when releasing to particular groups or individual organizations.”
Pseudonymized data, however, is a different story.
Compliance burdens are higher with pseudonymized data, which is more likely to fall under GDPR data protection standards.
According to GDPR’s Recital 26, “The principles of data protection should apply to any information concerning an identified or identifiable natural person. (¶) Personal data which has undergone pseudonymization, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”
GDPR attempts to bridge the gap between the protection that anonymized information, by design, should uphold to protect subject privacy and rights—and what may be lacking in pseudonymized datasets.
If datasets have undergone pseudonymization, they are still treated as personal and identifiable—via either direct or indirect identifiers—and so organizations are responsible for applying reasonable measures to safeguard such information.
They should also step carefully if they intend to share pseudonymized information with partners. As highlighted by a ruling issued by the General Court of the European Union in 2023, SRB v EDPS (1, 2), there is still some confusion surrounding data subject consent and the transmission of what could be considered as ‘personal’ data.
The motivated intruder test
Risk plays a part in how businesses should treat data and whether or not anonymization or pseudonymization should be applied.
To assist data protection officers in deciding what protections apply to stored data, the UK Information Commissioner's Office (ICO) has released guidelines for a “motivated intruder” test.
Unfortunately, there is often a chance that a cyberattacker, insider, or even an accidental insider could compromise data privacy and thereby leave an organization open to compliance issues and penalties. This test can be utilized to assess if a motivated intruder is ‘likely’ to be successful in identifying an individual “whose personal data the anonymous information is derived.”
The test assumes an intruder is reasonably competent, has access to appropriate resources—such as public documents and the internet—and is able to use investigative techniques.
If it is likely that data subjects could be identified by an unauthorized party, organizations may wish to revisit the data anonymization or pseudonymization techniques they have implemented and consider more stringent data anonymization practices.
Should organizations anonymize or pseudonymize their data?
Having explored the fundamental differences between anonymized and pseudonymized data, organizations can be better prepared to decide how to treat their data repositories.
Anonymized data is, currently, the golden standard for subject privacy, and allows organizations to protect the rights of the data subjects while being permitted to indefinitely hold anonymized datasets. However, the anonymization process will, by design, remove valuable information that can reduce the utility of such data to its controllers.
GDPR and similar data protection standards were introduced to steer organizations into adopting ‘data privacy by design’ frameworks for data management.
GDPR ensures that reasonable security and privacy conditions are met if pseudonymization is the preferred option. Organizations may be able to more effectively capitalize on data processed in this way, but on the condition they meet a higher data protection compliance burden, and they may also be subject to data retention limitations.
Data is one of the world’s most valuable commodities and one that is creating new opportunities for business growth and expansion worldwide. As such, organizations should ensure that the data they are responsible for, whether anonymized or pseudonymized, is adequately protected.