Ted Conbeer
· 4 min read
Five Questions to Ask If You Are Considering Synthetic Data
We often get asked about how our solution compares to synthetic data, and are surprised to find that most synthetic data companies aren’t straightforward about the risks and tradeoffs of their approach.
Do you want to anonymize the data you use for analysis, lower environments, or pre-sales trials? Our mission at Privacy Dynamics is to empower innovative and ethical data teams like yours, and we are so happy that you found us.
We often get asked about how our solution compares to synthetic data, and are surprised to find that most synthetic data companies aren't straightforward about the risks and tradeoffs of their approach. Before choosing synthetic data, be sure to ask these five questions of your vendor:
1. How long does it take to treat a new dataset, and what does that cost?
Training a large, complex generative model (the kind used in best-of-breed synthetic data generators) is computationally expensive and can rarely be done without any human intervention (in the form of hyperparameter tuning and other configuration). Compounding the problem, the models will need to be retrained regularly (as new data are added) and will not reflect schema changes in the source data without additional retraining. And unlike other types of ML models, the data-generating step can also be very computationally expensive -- generating even a modest number of records from a trained model can take even longer than training the model.
This can make synthetic data solutions expensive and slow to deliver value, compared to other anonymization methods.
We tested a leading synthetic data provider, using a demo dataset of 100k simulated users. Training a standard synthetic generator took over 14 minutes, and generating 100k synthetic records took another 32 minutes (using a single worker). A privacy-safe generator model took 43 minutes to train, and an additional 80 minutes to generate 100k synthetic records.
Privacy Dynamics can anonymize the same dataset (using a single worker) in under 60 seconds: a 100x speedup compared to synthetics.
2. Can you benchmark the utility of synthetic data vs. other anonymization methods?
In our research and that conducted by academics and third parties, we have found that most synthetic data is badly distorted, even along a single dimension.
Our demo dataset has an age
column; if we plot the percentage of users of each age,
we see the raw dataset has a smooth distribution around 35 years of age. While synthetic
datasets should be sampling from this distribution, we see that even with 100k records
generated, the age distribution in the generated dataset is badly distorted.
Even the most sophisticated data generators cannot capture all of the relationships inherent in your data. This makes synthetic data mostly useless for data analysis, analytics, and machine learning.
3. How do you maintain referential integrity in normalized and denormalized datasets?
Many synthetic data generators cannot capture all of the relationships in your data, especially when those relationships are explicit. Some solutions provide tools for maintaining primary and foreign key relationships in normalized tables, but those tools often require significant configuration and investment to prevent broken data.
It gets even worse with denormalized data. Our sample users dataset has some address fields:
zip
, city
, and state
. We know these fields have semantic meaning, and are deeply related
to one another, but many synthetic data generators can only infer the relationships using
statistics. This leads the Synthetic Data Generator we tested to create nonsense data,
where the same zip code is shared by cities that are thousands of miles apart, like Phoenix
and Anchorage:
zip | city | state |
---|---|---|
03103 | Manchester | New Hampshire |
03103 | Manchester | New York |
21234 | Parkville | Maryland |
21234 | Parkville | Georgia |
28269 | Charlotte | North Carolina |
28269 | Charlotton | Ohio |
28277 | Charlotte | North Carolina |
28277 | Charlotton | Ohio |
38654 | Olive Branch | Mississippi |
38654 | Olathe | Kansas |
57701 | Rapid City | South Dakota |
57701 | Las Vegas | Nevada |
85032 | Phoenix | Arizona |
85032 | Anchorage | Alaska |
4. How do you prevent leaking sensitive information into your synthetic data?
In practice, real datasets always contain outliers, and these outliers in the data must also be captured by any model trained on the data. Researchers have shown a real risk of leaking sensitive information in overfit synthetic datasets, and have suggested that "synthetic" does not mean "private" unless additional anonymizing techniques (like differential privacy) are used, which inherently further decreases the utility of the synthetic dataset (as we have seen above). Stadler et al. declare, "synthetic data either does not prevent inference attacks or does not retain data utility."
The "Privacy Safe" generator we tested struggled with both privacy and utility, especially with our location data. Not only did it suffer from the relational issue described above (placing the same zip code in as many as 17 states), the model's attempt to add differentially-private noise to zip codes and place names resulted in a large number of nonsense places, like the following state names:
state |
---|
Florido |
South Varolina |
Irdiana |
Leussees |
Lissilssippi |
Mennessee |
Colina |
Vassachusetts |
Lississippi |
Lissinsippi |
Wiston |
Leorgia |
Some synthetic generators provide configuration for limiting the cardinality of generated data, but these validators require manual tuning and maintainence and make the data generation process even slower. Without careful fine-tuning and upfront design (which is a lot of work!), including validating the privacy of a trained model using canaries, synthetic data usually falls short of its promise.
5. Can you provide an attestation of compliance (or any proof that the synthetic data will meet expert determination standards)?
The neural networks that produce synthetic datasets are effectively a black box. They may even be nondeterministic (when trained multiple times on the same data). This makes it nearly impossible for compliance officers to evaluate the privacy "guarantees" of synthetic data. To quote again the research from Stadler et al., "[b]ecause it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss."
Now we have a question for you:
Are You Interested in a Better Alternative?
Learn more about Privacy Dynamics and our microaggregation-based k-anonymizer. We think if you compare us to an approach using synthetic data, you'll see that we can provide better utility, better privacy guarantees, faster time-to-value, and lower cost.