Ted Conbeer
· 12 min read
Data Minimization in Analytics Using dbt and Privacy Dynamics
Data minimization is an important security, privacy, and compliance strategy, but applying it in an analytics context is hard. In this post, we show you how you can minimize the use of personal information in the Modern Data Stack, using dbt, Snowflake, and Privacy Dynamics.
What is Data Minimization?
Data Minimization has long been a tenet of responsible data best-practices, but more recently has become codified in data privacy regulations like GDPR and HIPAA, and other widespread standards, like SOC2. Essentially, data minimization is the principle that an organization should only collect, store, and process personal data that is directly relevant to its specific business goals. An organization should not collect personal data because it might become useful some day: unless there is a specific purpose that the data is directly relevant and necessary to accomplish, the collection and storage of that data may be unlawful.
Furthermore, data minimization applies to access and sharing of data, even within organizations. HIPAA explicitly covers this with its Minimum Necessary standard, which requires companies "to take reasonable steps to limit the use or disclosure of, and requests for, protected health information to the minimum necessary to accomplish the intended purpose."
Data Minimization Tactics for Data Teams
Data minimization should be applied at every step of the data "value chain." Many teams seek to "shift privacy to the left," or farther upstream, since it can be simpler to manage sensitive information at its source, before it proliferates through numerous models and tools. With that in mind, there are five independent tactics that data teams can use to achieve data minimization:
- Minimize collection. Data teams should work with their Product and Engineering teams to ensure that only necessary personal information is being collected by their product. By bringing business context into these conversations, analysts can ask questions like "Do we really need our customers' birthdays?" or "Who will be using the IP addresses we're planning on collecting?" When designing product tracking plans, analysts should limit the use of PII (like name and email) in event properties, and encourage the use of user traits instead.
-
Minimize access and consumption. Role-based access control (RBAC) is a critical component of any data minimization effort. Data teams need to understand the various personas, or access needs, of their stakeholders, create roles in their data warehouses, BI tools, and identity providers that map to those personas, and define policies for each role that limit access to sensitive data that is not required by that persona. Policies may limit access to entire datasets or to subsets of data, like individual fields or records. The implementation will depend on the data platform, and may take advantage of vendor-specific features, like Snowflake's Dynamic Data Masking. Alternatively, sequestering PII in a "privacy vault" and tokenizing data that is stored in most systems can help centralize the RBAC problem.
Sophisticated teams may wish to go further than limiting access by limiting consumption. This may require time-boxing role authorization, reviewing or auditing queries made by authorized users, and documenting reasons for every query of sensitive data.
- Minimize sharing. When partnering with Marketing teams, data team members should engage in audience-building and attribution efforts and ensure data sharing with third parties is minimized, and where possible, data processing by third parties is restricted.
- Minimize retention. The complement of collection is retention. Data that is old, stale, or no longer relevant should be deleted or made inaccessible to most end-users. Being realistic about the value of old data is important. Analysts should be mindful of changing business contexts (like product features, marketing budgets, and market sizes) and be willing to part with data from an earlier era.
- Minimize data in lower environments. Data engineers, analytics engineers, data scientists, and data analysts often interact with sensitive data in contexts where it is not required, like in the development of data pipelines, data models, ML models, and even dashboards or operational reports. Even if these same individuals are trusted with this data in other contexts, and even if sensitive data will flow through the production asset, it violates the tenets of minimization to use that same data in a development context. Data teams must develop tactics for minimizing personal data in development environments, which could include sampling or subsetting records; pseudonymizing data by masking, hashing, or faking direct identifiers; or manufacturing test or synthetic data.
If that sounds like a lot of work, that's because it is! Data minimization can be a burden for data teams of all sizes. And while new tooling can help with automation, a lot of the complexity of data minimization (and RBAC in particular) is driven by organizational and even political factors that are unlikely to be automated away.
Anonymization as an Alternative to Minimization
Under most regulations, standards, and ethical frameworks, data minimization is only important for personal data. While every regulation has its own definition of "personal data," nearly all limit it to (from GDPR) "information relating to an identified or identifiable natural person."
Accordingly, anonymization (or de-identification) provides an alternative to data minimization. If your data is truly anonymous, then it is safe to store, process, and use however you like.
However, anonymization is hard, because the treatment of both direct identifiers (like name and email address) and quasi-identifiers, which are traits like age, gender, and zip code, which can be combined and linked with external data to re-identify individuals. Historically, anonymization has required redacting or badly distorting sensitive data, which limited its usefulness for analytics. However, modern methods of anonymization, like those employed by Privacy Dynamics, minimize distortion in treated data, which enables even complex analytics on anonymized data.
Designing a Solution for Data Minimization in the Modern Data Stack
The "Modern Data Stack" is a loose category of tools that integrate with cloud data warehouses, typically in an ELT (extract-load-transform) and batch-processing paradigm. At the center of the Modern Data Stack is dbt, an open-source tool that automates data transformations in a data warehouse. dbt has adapter plug-ins that allow it to integrate with nearly any database, data warehouse, or data lake.
In the rest of this post, we'll share a design for using dbt and Privacy Dynamics with Snowflake to minimize the use of personal data while minimizing operational complexity and maximizing the data's analytical value.
Loading Minimal Data
In a typical dbt project, raw data is loaded into the warehouse before it is transformed. Many third-party tools, like Fivetran, Stitch, and Airbyte provide a huge number of "connectors" to nearly any database, object store, or SaaS API.
Some of these tools support data minimization directly by making it easy to configure what tables and fields should be replicated. Some go farther, and provide basic privacy transformations before loading your data; for example, Fivetran can hash fields before loading.
In Snowflake, we recommend loading all raw data into a database called RAW
. Each
data source gets its own schema in the RAW
database.
Access to the RAW
database should be strictly limited. Many data team members will
need the usage
privilege on the RAW
database, but ideally select
should be limited
just to administrators (the SYSADMIN
role) and service accounts for production pipelines.
Grants for such a scheme would look like:
create role loader;
grant usage on database RAW to loader;
grant create on database RAW to loader;
create role developer;
grant usage on database RAW to developer;
Anonymizing Raw Data
We will use anonymized data for all model development. We will achieve this by maintaining a schema-preserving copy of
the RAW
database in a new database, called RAW_SAFE
.
The simplest way to create a RAW_SAFE
database is to use Privacy Dynamics. After
connecting to Snowflake,
we can use the UI to select tables that need to be anonymized
("treated") and those that can simply be replicated without anonymization
("passed through").
After optionally fine-tuning the anonymization, we can set up a schedule to run the
anonymizer every hour. On each run, the anonymizer will detect any schema changes to
the source data, and if new tables are added to the RAW
database, it will
automatically replicate those with the default anonymization settings.
After repeating that process for each source, we are finished! We now have a
RAW_SAFE
database for our team to use.
If a large amount of data is being "passed through," we should consider other approaches that avoid storing duplicate copies of our raw data. Those could include:
- Using a dedicated dbt project (with stricter access controls than our main project)
to create views in
RAW_SAFE
that simplyselect *
from each table inRAW
that does not require anonymization. This code could be generated programmatically by the dbt-codegen package or a bespoke macro that introspects theINFORMATION_SCHEMA
. This dbt project would only have to run when new tables are added toRAW
. - Using Snowflake's zero-copy clones for the same purpose. Unlike views, clones would not be populated with new data, so the dbt project or cron job that creates the clones would have to run more frequently.
Without using Privacy Dynamics, it is difficult to treat quasi-identifiers and truly
anonymize data. However, pseudonymizing data (removing direct identifiers) is a form
of data minimization and is far better than doing nothing. We would again use
a dedicated dbt project for the transformation from RAW
to RAW_SAFE
, and use
a dbt package like dbt-privacy
or dbt-snow-mask to manually
mask and hash direct identifiers.
With any of these approaches, appropriate care and oversight (through code reviews or
similar) is necessary to keep the data in RAW_SAFE
truly safe. However, any efforts
here will pay for themselves with a simplified governance model downstream (this is
one form of "shifting privacy to the left").
Using Anonymized Data for Development and Testing
One of the most powerful features of dbt is its support for multiple environments,
or "targets." A dbt user can run the same code locally (on their laptop) as in
production, but by default a runtime parameter called target
will cause the local
run to write data to a different database or schema than the production run.
Targets are configured using a profiles.yml
file and can be given arbitrary names, but we'll use dev
to represent the development
case, and prod
to represent the automated, production runs of dbt. A simple
profile with dev
and prod
targets follows:
my-analytics-profile:
# the target key sets the default target at runtime
target: dev
# each key under outputs becomes a selectable target
outputs:
dev:
type: snowflake
account: <accountid>
user: <dev username>
password: <password>
role: DEVELOPER
warehouse: DEVELOPING
database: DEV
prod:
type: snowflake
account: <accountid>
user: DBT_PROD
password: <password>
role: TRANSFORMER
warehouse: TRANSFORMING
database: ANALYTICS
dbt uses an abstraction called "sources" for database relations that
are inputs to a dbt project, and that are created by a process outside of dbt.
Sources are
defined in a YAML file. To build a dbt project on our anonymized data,
we will create a source for each schema in our RAW_SAFE
database (note that
by default on Snowflake, database identifiers are not case-sensitive):
sources:
- name: app
database: raw_safe
tables:
- name: users
- name: orders
- name: stripe
database: raw_safe
tables: ...
While we believe most teams can use anonymized data for
production analytics,
teams seeking to only use anonymized data for development can configure their sources
to select from a different database, depending on the target
. We can do this because
dbt supports jinja templating in its .sql
and, to a limited extent, its .yml
files.
To only use anonymized data for our dev
target, we can substitute a jinja expression
for the database name we used above:
sources:
- name: app
database: "{{ 'raw' if target.name == 'prod' else 'raw_safe' }}"
tables:
- name: users
- name: orders
- name: stripe
database: "{{ 'raw' if target.name == 'prod' else 'raw_safe' }}"
tables: ...
Now, when run against the dev
target, dbt will select from the anonymized RAW_SAFE
database
and write to the DEV
database. When run against the prod
target, dbt will read
from the RAW
database and write to the ANALYTICS
database.
With this setup, you could lock down acccess to identifiable personal data by granting different
privileges to the DEVELOPER
and TRANSFORMER
roles. For example, the DEVELOPER
role gets nearly zero privileges on RAW
:
create role developer;
-- need usage on RAW to use views in RAW_SAFE that select from RAW
grant usage on database RAW to developer;
grant usage on all schemas in database RAW to developer;
grant usage on database RAW_SAFE to developer;
grant usage on all schemas in database RAW_SAFE to developer;
grant usage on future schemas in database RAW_SAFE to developer;
grant select on all tables in database RAW_SAFE to developer;
grant select on future tables in database RAW_SAFE to developer;
grant select on all views in database RAW_SAFE to developer;
grant select on future views in database RAW_SAFE to developer;
grant create_schema on database dev to developer;
While the production TRANSFORMER
role gets nothing on RAW_SAFE
(to reduce
the opportunity for dbt config mistakes):
create role transformer;
grant usage on database RAW to transformer;
grant usage on all schemas in database RAW to transformer;
grant usage on future schemas in database RAW to transformer;
grant select on all tables in database RAW to transformer;
grant select on future tables in database RAW to transformer;
grant select on all views in database RAW to transformer;
grant select on future views in database RAW to transformer;
grant ownership on database analytics to transformer;
If your team uses automated testing in continuous integration ("CI") and the CI runners (and target databases) are suitably secure, you could run your CI tests using either the anonymized or untreated data (or both!).
Anonymizing Modeled Data
If you preserve foreign key relationships, anonymizing raw data may not be sufficient to fully protect the subjects of your analysis from re-identification. When tables are joined, it is possible for k-anonymized quasi-identfiers in multiple tables to become unique and enable linkage attacks.
If your quasi-identfiers are spread between many source tables or systems, you should consider anonymizing modeled data, in addition to anonymizing the raw data. You should also anonymize any data asset that will be shared with a wider audience, either internally at your company or externally, for example, with marketing partners or even the broader public. Truly public datasets should be anonymized to a higher standard (a larger k value) to protect subjects as much as possible.
If you use Privacy Dynamics, the setup is largely the same as anonymizing your raw
data. Simply select your production ANALYTICS
database as a source, and choose
which tables to anonymize and store in a PUBLIC
database. Privacy Dynamics can run
on any schedule and be configured to run immediately after your dbt project completes.
Anonymize, Don't Delete
Data retention is an important component of data minimization, but most teams hesitate before deleting any old data, believing it may be useful for future analysis or training an especially data-hungry ML model.
Because anonymized data is not personal data, anonymization is an alternative to deletion in order to satisfy data retention policies. There are a few different ways to operationalize this approach:
- For append-only datasets, old data can be deleted at the source if desired.
The
RAW_SAFE
database becomes the only source for old data (and should be backed up and managed accordingly) - Row access policies can be used to strictly limit access to raw data before a certain date (or a dynamic date, defined with an interval)
- Views can be created to filter out "old" data from the identifiable datasets and union them with old data from the anonymized datasets
Pulling it all together
In this post, we showed how you can use Privacy Dynamics and dbt for personal data minimization in any analytics environment. We covered tactics for minimizing access and consumption, retention, sharing, and managing lower environments.
If you would like to start minimizing personal data in your analytics stack, reach out and we will get you started today on a free trial of Privacy Dynamics.