March 2015
Once released to the public, data cannot be taken back. As time passes, data analytic techniques improve and additional datasets become public that can reveal information about the original data. It follows that released data will get increasingly vulnerable to re-identification—unless methods with provable privacy properties are used for the data release. We review and draw lessons from the history of re-identification demonstrations; explain why the privacy risk of data that is protected by ad hoc de-identification is not just unknown, but unknowable; and contrast this situation with provable privacy techniques like differential privacy. We then offer recommendations for practitioners and policymakers. Because ad hoc de-identification methods make the probability of a privacy violation in the future essentially unknowable, we argue for a weak version of the precautionary approach, in which the idea that the burden of proof falls on data releasers guides policies that incentivize them not to default to full, public releases of datasets using ad hoc de-identification methods. We discuss the levers that policymakers can use to influence data access and the options for narrower releases of data. Finally, we present advice for six of the most common use cases for sharing data. Our thesis is that the problem of “what to do about re-identification” unravels once we stop looking for a one-size-fits-all solution, and each of the six cases we consider a solution that is tailored, yet principled.