We will create synthesized data for virtual patients, whose demographics and genome data are generated using data from the Personal Genome Project (PGP) and the HapMap Project. The challenge will solicit privacy preserving algorithms in two tracks to handle genomic data and combined genome/demographic data, respectively.
Regarding the first track that uses only genomic data, our tasks are: (a) Privacy preserving feature selection that meets the differential privacy criterion (to develop algorithms based on the synthetic samples, e.g., to select 10 most significant SNPs among 5000 SNPs in the provided sample data, on which differential privacy is enforced, and deliver the algorithms to challenge organizers to test its scalability and generalizability). The best solution is the one that allows a data user to conduct computation at the minimum level of information exposure (i.e., privacy budget) while maintaining the maximum level of accuracy in computing results (i.e., the best coverage of significant SNPs). (b) Privacy preserving data dissemination that is robust to known attack models (given the allele frequencies of SNPs at specific loci in a synthetic case group, to come up with perturbed data that cannot be used to re-identify case participants, while best preserve utilities, i.e., a set of association tests that are not given to the participants). Note that we will not enforce differential privacy in this challenge, but instead, will use statistical tests (e.g., the likelihood ratio test) to evaluate the re-identification risks.
For the second track that involves both genomic data and demographics, we will use the same criteria of the previous task except that we will also test the algorithm on subpopulation level (i.e., stratified by age, gender, etc.). Therefore, an ideal algorithm should also balance the privacy and utility on subpopulations. For the first task in both tracks, participating teams evaluate data privacy and utility through a web service to be provided by the challenge organizers, which measures the protection level against attacks and how well the utilities are preserved. Regarding the second task in both tracks, participating teams should use no more than a given privacy budget determined by the organizers and the utility can be verified locally with the association tests provided by the challenge organizers.
Each solution should be submitted by a participating team as a computer software package, which will be evaluated by an independent evaluation committee organized by the iDASH center, based on two criteria: (1) how the privacy risks are mitigated; and (2) how the utility of data is preserved. The utility functions (i.e., association tests) will not be revealed to the competition participants for the first task (although they can test it through a web service), but will be shared for the second tasks so that participating teams can also evaluate the utility locally.