Ted Conbeer
· 4 min read
How to Assess the Quality of De-Identified Data
In this tutorial, you will learn how to assess the quality of de-identified data.
When our customers first hear that Privacy Dynamics anonymizes data by perturbing quasi-identifiers like dates, locations, and demographics, some react with "No, that won't work for me. I need to use that data!"
To that, we say, "Of course you do! And you can!" Our proprietary, k-anonymity-based, micro-aggregation approach introduces only minimal distortion, especially for larger datasets. That means that you can use treated data in data pipelines, for advanced analytics, for dev and test data, and almost anywhere else you would use the sensitive data.
But don't take our word for it. You can sign up today and treat your first dataset within minutes. Then you can use the techniques below to evaluate the quality of the anonymized data, and compare it to what you get from other approaches, like synthetic data.
How We Change Your Data
Our algorithm first supresses direct identifiers, like names and email addresses, by redacting, faking, masking, or hashing them. Then, it creates a very large number of very small clusters of subjects in your data with similar quasi-identifiers, like ages and zip codes. We then replace individual subjects' values with an average from the cluster. At scale, over thousands or millions of records, only a small percentage of subjects are unique, and need their values changed. And with a large dataset, the unique subjects are likely to be similar to another member, so the changes, which we call "distortion", will be small.
Here's a toy example of roughly how our algorithm would treat a small dataset:
Raw Data
id | name | age | gender | zip | |
---|---|---|---|---|---|
1 | Abe | abe@example.com | 36 | m | 80210 |
2 | Brad | brad@example.com | 38 | m | 80301 |
3 | Charlie | charlie@example.com | 27 | f | 10013 |
4 | Dorothy | dorothy@example.com | 32 | f | 10007 |
... | ... | ... | ... | ... | ... |
Treated Data
id | name | age | gender | zip | |
---|---|---|---|---|---|
1 | (Redacted) | fake1234@example.com | 37 | m | 80210 |
2 | (Redacted) | fake2345@example.com | 37 | m | 80210 |
3 | (Redacted) | fake3456@example.com | 30 | f | 10013 |
4 | (Redacted) | fake4567@example.com | 30 | f | 10013 |
... | ... | ... | ... | ... | ... |
Assessing Distortion
For a treated dataset, we want to understand how many records have changed, and how big those changes are.
We are going to use a sample users dataset from a fake fitness company. This dataset contains only 10k records, which will exaggerate distortion (compared to a larger dataset). That dataset looks like this:
We treated the data using Privacy Dynamics. We have faked names and email addresses, redacted the stripe_id
, and have very conservatively
treated categorical columns as potential quasi-ID's. The resulting data looks like this:
It's easy to eyeball the data and see that our direct identifiers were treated as we intended, but what about the quasi-IDs? For that, we'll need a bit of SQL.
Univariate Distortion
While your first thought might be to join the post
records onto the pre
records, visualizing the changes is actually easier if we union
the two datasets together, and add a treatment_status
column:
select
id,
gender,
age,
weight_lbs,
height_in,
completed_workouts_last_month,
favorite_activity,
'Raw Data' as treatment_status
from pre
union all
select
id,
gender,
age,
weight_lbs,
height_in,
completed_workouts_last_month,
favorite_activity,
'Privacy Dynamics' as treatment_status
from post
Then, to assess the distortion in any single trait, we can generate histograms simply with:
select
<trait>,
treatment_status,
count(*) as cnt,
cnt * 1.0 / sum(cnt) over (partition by treatment_status) as pct_total
from unioned
group by 1, 2
Plotting multiple histograms overlaid as line charts is very effective at gauging distortion.
From the plots above, we get a good sense for how our univariate metrics for the whole dataset may change. Note that the treatment here introduces very minimal distortion and preserves discontinuities in the source data. However, you may wonder how individual records change when the microaggregation process is applied. For this, we will join the two datasets together:
select pre.age, post.age as age_post, pre.height_in, post.height_in as height_in_post
from pre
left join post on pre.id = post.id
And then we can produce a confusion matrix, either as a heatmap or a spike plot:
And now we see that, not only are the top-level aggregates preserved, but in the case where there are changes in quasi-IDs, those changes tend to be small (the line is tightly grouped around x=y
).
Multivariate Distortion
It is important in many analyses that the relationships between variables are maintained. Here, again, the microaggregation process shines. Even after drilling down into multiple dimensions, we see that statistics remain virtually unchanged (even for this small dataset):
Wrapping Up
In this post, we saw how to assess the distortion introduced when anonymizing quasi-identifiers with Privacy Dynamics. If you would like to learn more about Privacy Dynamics, you can try it for free or book a demo and we will help you design a solution for your organization.