Ted Conbeer
· 4 min read
DuckDB and PII Security: A Data Governance Challenge
DuckDB is an incredibly powerful tool enabling analysts and data scientists to move faster and process more data locally. Does this cause a problem for data governance and compliance?
DuckDB is an incredible technology that is threatening to upend the workflows of data analysts and data scientists everywhere. As a pip-installable, in-process analytical database, DuckDB couples the developer experience of sqlite with the power of a cloud data warehouse. Many workloads that previously needed to run on a multi-node cluster using Hadoop or Spark can now run on any analyst’s laptop after just a minute of setup.
For many analysts and data scientists, this means huge improvements in productivity and power. But it can also mean storing more data locally on a laptop using the familiar CSV and Parquet files, as well as DuckDB’s own compressed storage format. This proliferation of data to devices dramatically increases the surface area for any attack and the complexity of governing and securing the data. When that data contains PII, this can present a major problem for compliance and information security.
Is this really a new problem?
Microsoft Excel has been the dominant analytics tool for nearly four decades. In most companies, spreadsheets are ubiquitous for both analytical and operational use cases, and plenty of those spreadsheets include regulated Personal Information.
And it’s not just Excel — data scientists, marketing analysts, and other human systems integrators rely on downloading CSVs to their laptop to do their jobs.
The Achilles heel of Cloud data governance is the “Download” button — that is not new. What is new is the scale of data that can be processed locally with DuckDB, and the workflows that it enables. Excel has a 1M row limit, but most analysts know to keep it closer to 100k for decent performance. DuckDB can easily store and process 100M or 1B records. The direct implication is that a breach of a DuckDB database file could be 1,000 times worse than a breach of a spreadsheet. But the problem is even larger because of the workflow changes that DuckDB enables.
Previously, if an analyst wanted to work with data in Excel, they would carefully craft a query and pre-aggregate the data as much as possible in the cloud data warehouse. This limits the data downloaded, which was essential for performance, and had the ancillary benefit of limiting the information security risk. Now, analysts can just download entire tables from the warehouse and do everything in DuckDB. So instead of a pre-aggregated metrics table with a count of new signups per month, the starting point for local analysis is the raw signups table, which, if the analyst is not careful, will include a lot of PII.
To make things worse, against a backdrop of ever-increasing regulation, like the new CPRA and draft ADPPA, this proliferation of raw, personal data onto analyst devices is riskier than ever before.
What to do?
Nearly every “data person” understands the incredible responsibility of having access to an organization’s (and its customers’) data, but many may not fully understand or appreciate the increase in risk from large-scale local analytics workflows.
Rather than handcuff and antagonize the responsible members of your Data team by banning DuckDB, you should engage their sense of responsibility and accountability. Educate them on the problem, involve them in a solution, and make it easy for them to do the right thing. Lead a discussion with your team about the risks and problems, like:
- How do we keep our devices and local data safe?
- If we get a DSAR request, how will we as a team ensure that customer data is deleted from our local files?
- When is it appropriate or inappropriate to include direct identifiers like name and email addresses in local database dumps? What about quasi-identifiers like birthdate and gender?
- Should we create different database roles for local/notebook connections to automate these access controls to help keep this info out of DuckDB?
Anonymized Data: A Silver Bullet?
Properly anonymized, or de-identified, data cannot be tied back to an individual, and therefore is not regulated like Personal Information. Anonymization is therefore an alternative to more traditional minimization and governance strategies like role-based access controls and audit trails.
Historically, anonymization has been a difficult, expensive, and bespoke process, but with Privacy Dynamics, you can automate anonymization of assets in your data warehouse or data lake in just a few minutes. Our anonymizer uses group-based methods to ensure total de-identification while minimizing the distortion and maximizing the utility of the anonymized dataset, making it perfect for local analytics using DuckDB. You can encourage (or require) your team to use anonymized data for local exploration, reserving sensitive data for production pipelines in a governed cloud environment.
If you want to de-risk your team’s local analytics workflow, book a demo today.