YoVDO

Using Apache Spark and Differential Privacy for 2020 Census Data Protection

Offered By: Databricks via YouTube

Tags

Apache Spark Courses Cloud Computing Courses Data Security Courses Data Privacy Courses Differential Privacy Courses

Course Description

Overview

Explore the innovative use of Apache Spark and differential privacy in protecting respondent confidentiality for the 2020 US Census in this 29-minute talk. Dive into the challenges of balancing data accuracy with privacy protection while distributing $675 billion in federal funds and apportioning the US House of Representatives. Learn about the custom-built Spark application that performs millions of optimizations using mixed integer linear programs on a massive cluster. Discover the design of this differential privacy application and the sophisticated monitoring systems implemented in Amazon's GovCloud to oversee multiple clusters and thousands of application runs. Gain insights into the TopDown Algorithm (TDA) and how it addresses key challenges in monitoring Spark. Understand the importance of the Disclosure Avoidance System in enforcing global confidentiality protections for census data.

Syllabus

Intro
Abstract
Outline
Privacy and the Decennial Census
2010 Census: Summary of Publications (approximate counts)
We performed a database reconstruct and re-identification attack for all 308.745538 people in the 2010 Census
The basic idea of differential privacy: Uncertainty (noise) protects privacy
The Census Bureau is using differential privacy for the 2020 Census.
How much noise do we add? That's a policy decision.
We planned to create a Disclosure Avoidance System that dropped into the Census production system.
The Disclosure Avoidance System allows the Census Bureau to enforce global confidentiality protections
Our DP mechanism protects histograms of person types. Census "block"
Running the block-by-block algorithm with spark
In 2018 we invented the TopDown Algorithm (TDA)
Key challenges in monitoring spark
We created our own monitoring framework
Cluster List
Each DAS run is a "mission"
Mission Report
System Load
Free Memory
In Summary


Taught by

Databricks

Related Courses

Software as a Service
University of California, Berkeley via Coursera
Software Defined Networking
Georgia Institute of Technology via Coursera
Pattern-Oriented Software Architectures: Programming Mobile Services for Android Handheld Systems
Vanderbilt University via Coursera
Web-Technologien
openHPI
Données et services numériques, dans le nuage et ailleurs
Certificat informatique et internet via France Université Numerique