Brett Westover
· 5 min read
How to Use Pre-Configured Development Environments with Okteto
Development teams are using pre-configured environments to increase efficiency and minimize inconsistencies, but this hasn't solved the problem of getting representative and useful data to development teams. Privacy Dynamics, along with modern developer tools like Okteto, can safely bring production data into your development and test environments.
Safely using production data in development
Using anonymized data for development with Privacy Dynamics can be set up a variety of ways, but most will share a few common requirements:
- We need a source of data (typically this is a production database, or cloud data warehouse)
- We need a destination that development teams can access
- We need a way to periodically refresh the data, and effectively share or copy the data into dev environments
In this post we'll demonstrate how to do this with a PostgreSQL database using the Okteto open source movies example application. The Movies App is a microservices demonstration of a movies rental service. It has a service for reading a catalog of movies, services to respond to requests for rentals and move the data through a queue, and of course a frontend application that the users will interact with. It also has an "admin" interface that shows the list of users who are signed up. We'll focus on the user data which lives in the PostgreSQL instance, since it contains sensitive information that needs to be anonymized.
Create our data source and destination environments
Using Okteto's CLI we'll create a pair of environments with a few simple commands. From a local checkout of the movies repository:
-
Create a deployment of the movies app to use as a source for the data. When this comes up the users data is loaded automatically. We'll pretend this is production data for this demo:
- Create a namespace:
okteto namespace create movies-source
- Deploy the application:
okteto deploy
- Create a namespace:
-
Create a deployment of the movies app to use as a target for the anonymized data. In this case we'll skip loading the data on startup since we're going to populate the database with anonymized data from Privacy Dynamics. We can do that by passing an environment variable:
- Create a namespace:
okteto namespace create movies-target
- Deploy the application:
okteto deploy --var API_LOAD_DATA=false
- Create a namespace:
Load Balancer setup
To use the cloud version of Privacy Dynamics we need a way to access the source and target databases from the internet. We'll use an AWS load balancer here, but you can do this using NGINX Ingress Controller or any TCP capable load balancer in your cluster.
Create a service for the database in the source and target namespaces and use the AWS Load Balancer Controller annotations to make it accessible via the internet.
postgres-public-target.yml
:
apiVersion: v1
kind: Service
metadata:
labels:
app: postgresql
name: postgres-public-target
annotations:
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
ports:
- port: 19876
targetPort: 5432
selector:
app.kubernetes.io/component: primary
app.kubernetes.io/instance: postgresql
app.kubernetes.io/name: postgresql
Apply this config:
kubectl -n movies-source apply -f postgres-public-target.yml
kubectl -n movies-target apply -f postgres-public-target.yml
Let's anonymize that data!
Now that we have our "production" environment running, we can see the users data. There is clearly privacy sensitive data here (if this were a real application that is, this data is actually fake).
To anonymize this data we'll create a connection to the source of the data in the Privacy Dynamics application. We'll also create a connection to the target environment, which will be the destination for our anonymized data.
Next we'll create a project. There are just a few choices we need to make:
- Which source and destination connections to use - we'll use the ones we just created that point to our Okteto
movies-source
andmovies-target
databases. - Which tables to treat - There are only two in this demo: "rentals" and "users." We can go ahead and treat them both, but we're focused on "users" here.
- By default Privacy Dynamics will treat all identifiers. We'll choose to use "Realistic" data for direct identifiers (like the name and phone numbers) and to "Anonymize" the indirect identifiers (like City, State, and Zip Code).
We can also choose to set up a scheduled job. This is a great option for development environments since you'll likely want to periodically refresh with updated anonymized data from production. For our demo we'll skip that, and simply run the project.
This job only takes a few seconds to anonymize 10,000 users. We can see that it detected the direct identifiers automatically, replacing them with fake data. Privacy Dynamics treated the indirect identifiers using our surgical process to maximize utility while minimizing privacy risk. We provide a report that shows statistics on the distribution of values before and after treatment.
Taking a peek at our users in the anonymized target environment we can see that the names and phone numbers have been changed completely. The other fields have changed slightly in some cases, or in cases where there was no detectable risk to privacy, not changed at all. Privacy Dynamics carefully avoids large changes to the distributions of indirect identifier data, allowing for development environments that are far more representative of production.
Volume Snapshots for faster and repeatable builds
Now that we have this anonymized data in our movies-target
environment, we can use the Volume Snapshots feature in our Self-Hosted Okteto cluster to create a snapshot. Then we can use the data repeatedly in development or preview environments.
After setting up the cluster to support Kubernetes VolumeSnapshots and enabling the feature, let's create a snapshot of the PostgreSQL data volume in our movies-target
environment.
Apply a bit of YAML to create the snapshot, specifying a source namespace, and the name of the PersistentVolumeClaim for the database:
snapshot.yml
:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
namespace: movies-target
name: dbdata-snapshot
spec:
volumeSnapshotClassName: okteto-snapshot-class
source:
persistentVolumeClaimName: data-postgresql-0
Apply this with kubectl -n movies-target apply -f snapshot.yml
It might take a minute or two for the snapshot to be created. You can see when it's ready:
$ kubectl -n movies-target get volumesnapshot
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
dbdata-snapshot true data-postgresql-0 5Gi okteto-snapshot-class snapcontent-37d54cf5-a990-460c-81da-c4b5aa6a7e9e 60s 61s
Now this snapshot is ready for use in Okteto development or preview environments.
Preview environments using the anonymized snapshot
Following the docs in the movies repository we will make a small code change in the movies repo to tell the Okteto GitHub Action to use our new snapshot.
name: pr-${{ github.event.number }}-cindylopez
scope: global
-
+ file: "okteto-with-volumes.yaml"
+ variables: "API_LOAD_DATA=false,DB_SNAPSHOT_NAME=dbdata-snapshot,DB_SNAPSHOT_NAMESPACE=movies-test"
Once this change is in place we'll get preview environments on all PRs with the anonymized snapshot that we specified.
Privacy Dynamics + Okteto = Awesome
You can sign up for a free trial of Privacy Dynamics at https://www.privacydynamics.io/ and start anonymizing data in minutes. Okteto offers a powerful and fast development tool to supercharge your development process, check out a free trial at https://www.okteto.com/ Together, we are excited to offer an awesome option for using anonymized prod data in your development environment.