Presenter: Haley Hunter-Zinck, PhD
Haley Hunter-Zinck joined BIDS and BCHSI/UCSF as a Data Science Health Innovation Fellow in the Innovate For Health program.
This week's event will feature Haley Hunter-Zinck, Data Science Health Innovation Fellow in the Innovate For Health program, as our guest speaker who will present and lead a discussion on synthetic data.
Synthetic data, data that mimics realistic patterns but does not correspond to actual data records, has the potential to provide high utility datasets while preserving privacy of contributors to real datasets. This concept is especially relevant to sensitive data such as patient electronic health records and for generators trained on real datasets (e.g. via generative adversarial networks). However, how to appropriately quantify the risk of a given synthetic dataset to patient privacy and formally certify a synthetic dataset as low risk are open questions. Previous research has validated the privacy-preserving qualities of synthetic datasets through simple checks for memorization or in the context of membership or attribute inference attacks. A few studies have integrated differential privacy techniques into the synthetic data generator training procedure.
We would welcome discussion and expertise on any of the following or related questions. Are these techniques necessary and sufficient for quantifying the risk of synthetic datasets to patient privacy? What other techniques should be applied? What level of risk is acceptable? How does the certification procedure for synthetic datasets relate to certifying de-identified data? How open should synthetic data access be?
About the Working Group
The goal of this working group is to understand issues around sensitive/restricted use research data from a variety of views - especially from the perspective of Berkeley researchers who need and use such data and the staff and units who support that. We will also seek to develop concrete solutions and products - whether it is environments, model security plans or data use agreements, or compendia of local data or resources. A third goal of this group is to provide input to IT and other organizations working on developing a set of suggested solutions to provide to campus leadership. We expect to cover:
- potential sources for these research data
- the legal environment and constraints under which data can be shared
- models for hosting RUD
- language used in data protection plans for various providers
- places on campus which can house restricted use data
- OPHS/IRB concerns and resources
- ... and topics you might want to see covered.