How to Assess the Quality of De-Identified Data

Ted Conbeer

02.24.23 · 4 min read

How to Assess the Quality of De-Identified Data

In this tutorial, you will learn how to assess the quality of de-identified data.

When our customers first hear that Privacy Dynamics anonymizes data by perturbing quasi-identifiers like dates, locations, and demographics, some react with "No, that won't work for me. I need to use that data!"

To that, we say, "Of course you do! And you can!" Our proprietary, k-anonymity-based, micro-aggregation approach introduces only minimal distortion, especially for larger datasets. That means that you can use treated data in data pipelines, for advanced analytics, for dev and test data, and almost anywhere else you would use the sensitive data.

But don't take our word for it. You can sign up today and treat your first dataset within minutes. Then you can use the techniques below to evaluate the quality of the anonymized data, and compare it to what you get from other approaches, like synthetic data.

How We Change Your Data

Our algorithm first supresses direct identifiers, like names and email addresses, by redacting, faking, masking, or hashing them. Then, it creates a very large number of very small clusters of subjects in your data with similar quasi-identifiers, like ages and zip codes. We then replace individual subjects' values with an average from the cluster. At scale, over thousands or millions of records, only a small percentage of subjects are unique, and need their values changed. And with a large dataset, the unique subjects are likely to be similar to another member, so the changes, which we call "distortion", will be small.

Here's a toy example of roughly how our algorithm would treat a small dataset:

Raw Data

id	name	email	age	gender	zip
1	Abe	abe@example.com	36	m	80210
2	Brad	brad@example.com	38	m	80301
3	Charlie	charlie@example.com	27	f	10013
4	Dorothy	dorothy@example.com	32	f	10007
...	...	...	...	...	...

Treated Data

id	name	email	age	gender	zip
1	(Redacted)	fake1234@example.com	37	m	80210
2	(Redacted)	fake2345@example.com	37	m	80210
3	(Redacted)	fake3456@example.com	30	f	10013
4	(Redacted)	fake4567@example.com	30	f	10013
...	...	...	...	...	...

Assessing Distortion

For a treated dataset, we want to understand how many records have changed, and how big those changes are.

We are going to use a sample users dataset from a fake fitness company. This dataset contains only 10k records, which will exaggerate distortion (compared to a larger dataset). That dataset looks like this:

We treated the data using Privacy Dynamics. We have faked names and email addresses, redacted the stripe_id, and have very conservatively treated categorical columns as potential quasi-ID's. The resulting data looks like this:

It's easy to eyeball the data and see that our direct identifiers were treated as we intended, but what about the quasi-IDs? For that, we'll need a bit of SQL.

Univariate Distortion

While your first thought might be to join the post records onto the pre records, visualizing the changes is actually easier if we union the two datasets together, and add a treatment_status column:

select
    id,
    gender,
    age,
    weight_lbs,
    height_in,
    completed_workouts_last_month,
    favorite_activity,
    'Raw Data' as treatment_status
from pre
union all
select
    id,
    gender,
    age,
    weight_lbs,
    height_in,
    completed_workouts_last_month,
    favorite_activity,
    'Privacy Dynamics' as treatment_status
from post

Then, to assess the distortion in any single trait, we can generate histograms simply with:

select
    <trait>,
    treatment_status,
    count(*) as cnt,
    cnt * 1.0 / sum(cnt) over (partition by treatment_status) as pct_total
from unioned
group by 1, 2

Plotting multiple histograms overlaid as line charts is very effective at gauging distortion.

From the plots above, we get a good sense for how our univariate metrics for the whole dataset may change. Note that the treatment here introduces very minimal distortion and preserves discontinuities in the source data. However, you may wonder how individual records change when the microaggregation process is applied. For this, we will join the two datasets together:

select pre.age, post.age as age_post, pre.height_in, post.height_in as height_in_post
from pre
left join post on pre.id = post.id

And then we can produce a confusion matrix, either as a heatmap or a spike plot:

And now we see that, not only are the top-level aggregates preserved, but in the case where there are changes in quasi-IDs, those changes tend to be small (the line is tightly grouped around x=y).

Multivariate Distortion

It is important in many analyses that the relationships between variables are maintained. Here, again, the microaggregation process shines. Even after drilling down into multiple dimensions, we see that statistics remain virtually unchanged (even for this small dataset):

Wrapping Up

In this post, we saw how to assess the distortion introduced when anonymizing quasi-identifiers with Privacy Dynamics. If you would like to learn more about Privacy Dynamics, you can try it for free or book a demo and we will help you design a solution for your organization.