YoVDO

Using Apache Spark and Differential Privacy for 2020 Census Data Protection

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Cloud Computing Courses Data Security Courses Data Privacy Courses Differential Privacy Courses

Course Description

Overview

Explore the innovative use of Apache Spark and differential privacy in protecting respondent confidentiality for the 2020 US Census in this 29-minute talk. Dive into the challenges of balancing data accuracy with privacy protection while distributing $675 billion in federal funds and apportioning the US House of Representatives. Learn about the custom-built Spark application that performs millions of optimizations using mixed integer linear programs on a massive cluster. Discover the design of this differential privacy application and the sophisticated monitoring systems implemented in Amazon's GovCloud to oversee multiple clusters and thousands of application runs. Gain insights into the TopDown Algorithm (TDA) and how it addresses key challenges in monitoring Spark. Understand the importance of the Disclosure Avoidance System in enforcing global confidentiality protections for census data.

Syllabus

Intro
Abstract
Outline
Privacy and the Decennial Census
2010 Census: Summary of Publications (approximate counts)
We performed a database reconstruct and re-identification attack for all 308.745538 people in the 2010 Census
The basic idea of differential privacy: Uncertainty (noise) protects privacy
The Census Bureau is using differential privacy for the 2020 Census.
How much noise do we add? That's a policy decision.
We planned to create a Disclosure Avoidance System that dropped into the Census production system.
The Disclosure Avoidance System allows the Census Bureau to enforce global confidentiality protections
Our DP mechanism protects histograms of person types. Census "block"
Running the block-by-block algorithm with spark
In 2018 we invented the TopDown Algorithm (TDA)
Key challenges in monitoring spark
We created our own monitoring framework
Cluster List
Each DAS run is a "mission"
Mission Report
System Load
Free Memory
In Summary


Taught by

Databricks

Related Courses

Introduction to Data Analytics for Business
University of Colorado Boulder via Coursera
Digital and the Everyday: from codes to cloud
NPTEL via Swayam
Systems and Application Security
(ISC)² via Coursera
Protecting Health Data in the Modern Age: Getting to Grips with the GDPR
University of Groningen via FutureLearn
Teaching Impacts of Technology: Data Collection, Use, and Privacy
University of California, San Diego via Coursera