Skip Navigation

Panel 2: Statistical Disclosure Control and the HIPAA Privacy Rule Protections

The second panel consisted of five presentations.

5.1. Latanya Sweeney – Carnegie Mellon University

Dr. Sweeney framed her presentation in terms of lessons learned over the past 12 years and what they suggest in terms of improvements under HIPAA. The first part of her remarks addressed the identifiability of data, HIPAA “certification,” and the risk assessment server approach used by Privacert Inc. The second portion of her presentation addressed re-identification experiments and successes in multi-party computation.

Dr. Sweeney briefly reviewed her pre-Privacy Rule work on re-identification. When the Privacy Rule was first issued, there was no definition of very small risk as required in the Expert Determination approach. She noted that legal experts advised against using this risk criteria as a result. Rather, she explained that she developed a Minimal Risk Standard, which requires the identifiability of data to be no greater than that prepared under the Safe Harbor provision, (0.4 percent for demographic data) even if it included fields not addressed by Safe Harbor, to address these concerns.

She then stated that the risk assessment server approach identifies publically and semi-publically available data, uses an inference engine to determine potential linkage pathways for re-identification, and requires the maintenance of an accurate population model. Dr. Sweeney illustrated how the risk assessment server works through a project with New York bioterrorism surveillance data. In order to bring the biosurveillance data into HIPAA compliance, dates of birth were limited to month and year and diagnosis codes were converted into syndrome or sub-syndrome codes. She suggested that this sort of data should be maintained in a registry and require a public notice when the data is shared.

She indicated that the most significant area of concern with the risk assessment approach is business associate agreements. Problems associated with these agreements include loss of control of data, errors in data, and the unbounded nature of the data once it leaves the institution.

Dr. Sweeney also addressed whether comparison to the Safe Harbor standard is sufficient.  Using the 2006 Violence against Women Act reauthorization (VAWA), which is similar to HIPAA and requires certification that data is not re-identifiable, she showed how thinking has shifted toward “provable privacy”.  The approach developed by Dr. Sweeney used secure multi-party computation and enabled the development of longitudinal records built from multiple sources without encryption or a common hash function. She indicated that if the standard for HIPAA moved closer to that of VAWA, it would push the proof of anonymity into a formalism that could be open to public inspection. This allows the data to remain “local” (close to where it was created). The anonymous query results are shared, not the actual data.

Dr. Sweeney indicated that one lesson learned was the idea that technology design dictates policy. As a result, technology designers need to better understand the context in which they work and policy needs to be reoriented to use technology (keep data local).

Back to top | Table of Content

5.2. Daniel Barth-Jones – Columbia University and Wayne State University

Dr. Barth-Jones presented an overview of the statistical or Expert Determination de-identification approach and provided examples of the importance of balancing privacy protection and statistical accuracy. He noted it is essential that researchers be able to conduct accurate research and analyses using the information expected to come from the widespread adoption of EHRs to improve decision making and science.

He explained the process of re-identification of records is achieved through linkage of common data fields. Resolution of the linkages relies on the number of matching fields available and the level of detail contained within these fields. He mentioned it is important to consider the role of information that is part of the public record or considered reasonably available as quasi-identifiers[1]  when applying the HIPAA statistical de-identification approach. He anticipated that the volume of this information will increase significantly as the amount of readily accessible records expands.

Dr. Barth-Jones reviewed the notion of “k-anonymity”, which states that a dataset is protected when each shared record is equivalent to k-1 other records over a set of fields.[2] However, he noted that the majority of records that are unique in a data set will turn out to be non-unique in a larger dataset. This implies that researchers are probably over de-identifying data, which poses risks to various statistical characteristics. Along these lines, he noted it is essential to distinguish between sample and population uniques. Sample uniques are individuals occurring only once in a data set. Population uniques are those whose characteristics occur once in an entire population. The combination of samples and populations poses a disclosure risk. With extra effort, uniques could potentially be identified (risk levels vary depending on the relationship between the sample and population uniques).

As an alternative, he noted that the statistical, or Expert Determination de-identification provision allows for certification by a qualified statistician that de-identified data poses a very small risk of re-identification when used alone or in combination with other reasonably available data. As a result, statistical de-identification can be used to release some Safe Harbor information (e.g., geography, dates of service) if appropriate disclosure analyses have been conducted.

Dr. Barth-Jones described several approaches to reduce disclosure risks. The simplest methods – reducing linking key resolution and increasing reporting unit populations – tend to be the most practical for protecting privacy in data streams. Reducing linking key resolution is accomplished by reducing the number of quasi-identifiers released or by reducing the number of categories/values within a quasi-identifier. Successful solutions balance and protect the protection of privacy and the preservation of the utility and accuracy of the statistical analyses. He pointed out that some approaches to de-identification degrade accuracy of the analyses, which results in both bad science and bad decisions.[3]

Dr. Barth-Jones also addressed several attacks on privacy that Safe Harbor may not mitigate. One challenge is that “family linkage” attacks require only ZIP code, age, and gender information to re-identify family members. The risk of re-identification by this method can be reduced by simply increasing the population size within a geographic area, taking into account the underlying population density. Another approach is the imposition of minimum population size thresholds for geographic reporting units (statistical disclosure risk analyses should be performed to ensure that the thresholds are appropriate for the set of variables to be reported). Geographic censoring, which does not report data for individuals with high disclosure risks, and geographic masking, which aggregates geographic units with high re-identification risks into larger population units, are useful in more limited circumstances. Overlapping reporting geographies can also create disclosure risks by creating small populations (less than 10,000).

A second challenge is that “geoproxy attacks” use geographic information systems to determine likely locations for individuals based on the location of their healthcare providers. These attacks have become easier with the availability of Web mapping tools on the Internet. Three-digit ZIP code information can be combined with provider data to reveal more information than originally thought possible. Dr. Barth-Jones suggested that the Safe Harbor requirement for recoding ZIP codes to “000” for populations less than 20,000 can inadvertently reveal the three-digit ZIP codes that were supposed to be protected. He suggested that this be corrected in any future guidance.

In conclusion, Dr. Barth-Jones stressed that de-identified health data is an invaluable public good with numerous uses, including driving health systems improvements. Future progress in healthcare reform will rely on the use of de-identified data. Statistical de-identification is a reliable method for balancing privacy protection and the utility and accuracy of the data. He recommended that the statistical de-identification provisions continue to support very small risk thresholds and that the Safe Harbor provisions be revised to better address geographic attacks.

Back to top | Table of Content

5.3. Jerome Reiter – Duke University

Dr. Reiter provided a statistician’s perspective on data confidentiality and access.  He noted that data access is a tradeoff between usefulness and disclosure risks. He recommended that policies should depend on the level of trust the organization has in the data users.

Policies for highly trusted users, such as credentialed researchers, should focus on preserving utility and placing few restrictions on data access.  Approaches suggested by Dr. Reiter included stronger support for licensing programs and accessing data via virtual enclaves, where data can be seen but not saved.  The enclave would allow researchers to preview data to determine its utility to their projects.

Policies for low-trust users, such as the general public, should emphasize protection of confidentiality and more tightly restrict access to data. Disclosure risks in this category relate to data such as geography, race, sex, and other data elements that might not seem significant at the present, but might have significance in the future.

Dr. Reiter explained how statistical disclosure control specialists determine risks of disclosure through a three-part process. The process involves:

  1. Determining all variables that might be known to an intruder,
  2. Computing probabilities concerning what the intruder knows (e.g., whether the target is part of the sample), and
  3. Determining if the resulting risk level is small enough relative to the utility of the data.

In conclusion, Dr. Reiter noted that risks are specific to each data set. He believes this makes the development and application of a generic Safe Harbor-like policy undesirable from a disclosure control perspective. He recommended that HHS encourage the development of Centers of Excellence that maintain best practices and assist researchers in the development of disclosure protocols.

Back to top | Table of Content

5.4. Fritz Scheuren – National Opinion Research Center (NORC)

Dr. Scheuren contended that the de-identification community does not yet know enough to develop standards.  Rather, he was very positive about the use of synthetic data for situations outside HIPAA coverage, and supported the use of data enclaves and Centers of Excellence alluded to by Dr. Reiter.

Dr. Scheuren explained how NORC has undertaken approximately 55 de-identification/certification projects since the advent of the HIPAA Privacy Rule. He noted that, over time, there has been great improvement in data quality. At the same time, the issues surrounding the data have become more subtle in both subject matter and disclosure consequences. He believed this demands the formation of a community in which members can share best practices (currently, there are only multiple smaller communities addressing these issues). Dr. Scheuren expressed his opinion that standards could not be developed until such a large community has been formed.

With regard to reasonable risk, Dr. Scheuren indicated that the lack of specificity in the Privacy Rule is actually a strength. The Safe Harbor concept forms the basis for approaches that are reasonable (not absolute) and operational.  He noted that the notion of reasonable is a moving target, such that it changes over time and requires constant reevaluation. As a result, application of de-identification processes (e.g., hashing and encryption) requires ongoing modifications to address changing conditions.

He also remarked on how data quality also poses challenges to effective de-identification. Assessments can only be made based on what is reported, not what is missing. He noted that missing data and misreporting rates have improved over time, but he advocated for the discontinuation of public use files and the increased use of synthetic data.

He next stated that certification processes must be based on an understanding of the underlying data. Certifiers (of de-identified data sets) must address issues such as reverse engineering, individual data items (i.e., uniques), user training, and systems security. He was adamant that statistical checks are not sufficient protection and that much more focus needs to be given to security issues and business relationships of the data providers and users.

Dr. Scheuren stressed the importance of developing a community through conferences and workshops (such as this workshop) and other means to ensure a culture of privacy and sharing of best practices. He also advocated for the development of checklists (not standards) and “buyer beware” tutorials to educate data providers about the quality of statisticians with regard to certifying data privacy risks.

Back to top | Table of Content

5.5. Lawrence Cox – National Center for Health Statistics

Dr. Cox also addressed the topic of statistical disclosure limitation (SDL). He indicated that SDL toolkits are valuable tools and there are many methods that have been implemented and shared through such tools. With respect to the needs of HIPAA, he stated that solutions should be based on best practices and work 80% of the time, whereas the remaining 20% are special cases that will need to be solved by the community.

He noted that twenty years ago, SDL was done post-production and “on the fly.”  It was not approached in a comprehensive manner. Unfortunately, this is still the case in many situations, although recent research shows some promise for improving practices.

Like previous speakers, he stressed the balance between disclosure limits and data value. The Safe Harbor provisions illustrate how SDL degrades quality, but he noted that HIPAA population thresholds are significantly lower than in other sectors.

Dr. Cox briefly reviewed various types of risks that should be accounted for.  He noted that there are deterministic risks related to populations, samples, “matchables” to known files, and insider disclosure. He also remarked on probabilistic (or statistical) risks, such as a data recipient’s ability to infer certain information, such as bounds to the maximum, average, and probability of increasing knowledge categories. In his opinion, a combined approach, would be the ideal method, but it would be very difficult to achieve. It is the responsibility of the community to determine at what point risk becomes unacceptable along the continuum of disclosure and usability.

Dr. Cox indicated that it was his belief that “signatures” within records (such as blood pressure readings) are of concern under HIPAA as they are unique to the individual. He suggested rounding as a means of addressing this uniqueness. Another problem relates to “entropy”, such that the more data is released now, the less data can be released later due to the ability to track individuals longitudinally.

Moving forward, Dr. Cox urged that the community begin thinking about common definitions and methods regarding quality assessment. There is an opportunity at the present to address these issues in an orderly and comprehensive manner.

Back to top | Table of Content


5.6. Discussion after Panel 2

To ensure the workshop remained on time, the discussion regarding this panel was held over to Panel 3.



[1] A quasi-identifier corresponds to the set of attributes in a data set that could be linked with external information to identify an individual. This definition is paraphrased from:

Dalenius T. Finding a needle in a haystack – or identifying anonymous census record. Journal of Official Statistics. 1986; 2: 329-336.

[2] Samarati P, Sweeney L. Generalizing data to provide anonymity when disclosing information (abstract). Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 1998: 188.

Sweeney L. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness, and Knowledge-Based Systems. 2002; 10: 557-570.

[3] Aggarwal C. On k-anonymity and the curse of dimensionality. Proceedings of the 31st International Conference on Very Large Data Bases. 2005: 901-909.