I am a researcher at the Harvard John A. Paulson School of Engineering and Applied Sciences. I work on machine learning and statistical methods for broad problems in quantitative social science, such as privacy preservation, missing data, measurement error and human-guided machine learning. I also run an exceptional software team building tools to implement these solutions.
Differential privacy is the gold standard definition of privacy protection. The WhiteNoise project aims to connect theoretical solutions from the academic community with the practical lessons learned from real-world deployments, to make differential privacy broadly accessible to future deployments. Specifically, we provide several basic building blocks that can be used by people involved with sensitive data, with implementations based on vetted and mature differential privacy research. In WhiteNoise Core, we provide a pluggable open source library of differentially private algorithms and mechanisms for releasing privacy preserving queries and statistics, as well as APIs for defining an analysis and a validator for evaluating these analyses and composing the total privacy loss on a dataset.
OpenDP is a community effort to build a trustworthy suite of open-source tools for enabling privacy-protective analysis of sensitive personal data, focused on a library of algorithms for generating differentially private statistical releases. The target use cases for OpenDP are to enable government, industry, and academic institutions to safely and confidently share sensitive data to support scientifically oriented research and exploration in the public interest. We aim for OpenDP to flexibly grow with the rapidly advancing science of differential privacy, and be a pathway to bring the newest algorithmic developments to a wide array of practitioners.
Project Website Our White Paper Video on Statistical Functionality
TwoRavens is a Web-based platform for statistical analysis. The goal is to allow the domain expert, in concert with our system, to complete a high quality, predictive and interpretable model without a statistical expert. To do so, the system facilitates intuitive machine learning and model interpretation, model discovery, and data exploration. As our intelligent back-end automatically seeks interesting relationships in the data and builds models to predict outcomes, researchers impart substantive knowledge about their data and own research questions to guide the automated generation of AI assistance for data analysis in an interactive paradigm we call human-guided machine learning.
This prototype system allows researchers with sensitive datasets to make differentially private statistics about their data available through data repositories, such as the Dataverse platform. The system allows researchers to: [1] upload private data to a secured Dataverse archive, [2] decide what statistics they would like to release about that data, and [3] release privacy preserving versions of those statistics to the repository, [4] that can be explored through a curator interface without releasing the raw data, including [5] interactive queries. This system was created by the Privacy Tools for Sharing Research Data project. Differential privacy is a mathematical framework for enabling statistical analysis of sensitive datasets while ensuring that individual-level information cannot be leaked.
Amelia II "multiply imputes" missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). The program also generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data. Amelia II also includes useful diagnostics of the fit of multiple imputation models. The program works from the R command line or via a graphical user interface that does not require users to know R.
Zelig is a framework that brings together an abundance of common statistical models found across R packages into a unified interface, and provides a common architecture for estimation and interpretation, as well as bridging functions to absorb increasingly more models into the collective library. Zelig allows each individual package, for each statistical model, to be accessed by a common uniformly structured call and set of arguments. Moreover, Zelig automates all the surrounding building blocks of a statistical workflow --procedures and algorithms that may be essential to one user's application but which the original package developer did use in their own research and might not themselves support. These include bootstrapping, jackknifing, and reweighting of data, and in particular, Zelig automatically generates predicted and simulated quantities of interest (such as relative risk ratios, average treatment effects, first differences and predicted and expected values) to interpret and visualize complex models.