FAQ:

Question: What is the configuration of the VM?

Reply: The VM image (link) has 4 cores with 4GB memories and 200 GB disk, which can be launched using VMware player.

 

Question: What are the criteria when judging the submissions?

Reply: The implemented programs will be tested on a VM and the performance will be measured by the overall running time (including communication for task 2) and data exchange (for task 2 and 3) as well as the precision of calculated statistics.

 

For task 1:

We will evaluate each algorithm based on the maximum number of correct queries that it can respond before any individual can be re-identified by the Bustamante attack. We provide a training beacon and the code (based on the general public allele frequency, e.g., obtained from 1000 genome) for participating teams to evaluate their model.

 

For tasks 2 and 3:

Accuracy: each solution must achieve the minimum required accuracy, i.e., to be useful for biomedical studies.

Security: each solution must fulfill the minimum-security protection standard (i.e., under the semi-honest, adversary model with at least 80-bit security level). 80-bit security level means that it will take a computer, on average, approximately 280 operations to break the encryption. A semi-honest (honest but curious) adversary model is defined as each party following the protocol while also trying to discover more sensitive information about the underlying data.

Performance in terms of complexity: The execution time of each solution was measured under the same or comparable software and hardware setups. The evaluation includes the running time for encryption, computation over encrypted data, and decryption.

Performance in terms of storage: As data encryption increases the size of the data, we measured the storage efficiency of each solution after applying their proposed cryptographic protocols.

Performance in terms of communication cost: Cryptographic protocols involve the communication of encrypted data between computational nodes. We measured the communication cost in terms of network bandwidth consumption.

 

Question: What is the hardware specification used for the judgment? Does it support hardware AES-NI? how many cores in the CPU? What is the network bandwidth?

Reply: We provide a VM template (link). The image has 4 cores with 4 GB memory and 200 GB disk. It does support hardware AES-NI and the bandwidth is upto 10Mbps.

 

Question:: It is asked to use homomorphic encryption: does this mean specifically *fully* (or "somewhat" fully) homomorphic encryption, semi-homomorphic, or either?

Reply: The participating team can use any kind of homomorphic encryption schemes (e.g., fully, semi, or partial homomorphic encryption schemes).

 

Question: More details about Task 3?

Reply:For all cases (i.e., a full match or not), we need to return a vector with the same size, so that there is no potential leakage. For example, you can return a binary vector (0 unmatched, 1 matched) for each case, where the size is the same as the number of people in the database. Or, a vector of fixed size (which represents the maximum number of possible matched records, e.g., 100 or 1000). There will be multiple VCF files corresponding to different people in the online database. We allow client to perform a small amount of computation in plaintext (retrieved from cloud in encrypted format), but it is limited up to 100 comparison operations. We also limited the total number of positions (less than 20) to be retrieved from the server (in an encrypted format). Our test queries will be just a few SNPs (less than 5)
it is important to have only a very limited amount of additional data retrieved together with the answer of a query: e.g., for a specific SNP, no more than 20 other SNPs can be downloaded. Also keep in mind, though such additional data download won’t disqualify a participating team, it certainly put the team in a relatively disadvantage position with regard to other teams that do not have this requirement, for the sake of fairness. To retrieve these SNPs, Also importantly, no data access pattern should be exposed during the query: e.g., once a SNP is accessed, from the cloud’s perspective, the data block containing the SNP should look no more likely to be accessed than other blocks when the same SNP is queried again. Note that the purpose of this competition is to understand the progress of crypto techniques in terms of their capability to support real-world genome data analysis. The specific task given here just serves as an example. Therefore, our evaluation will balance performance, security guarantee and importantly the generality of the solution (after all, the technique we are looking for is expected to work on other, more complicated genetic tests than those involving just exact match).

 

Question: Beside the initial encryption, are any interaction allowed between the parties (random share protocols etc)?

Reply: The task 3 focuses on the Homomorphic encryption. Interaction between the data owner and remote cloud are only limited to upload the data and obtain the result.  In contrast, task 2 allows each participating team to use any SMC protocols.

 

Question:  Finally, we are not sure what is meant by "80-bit security level": which could refer to many different things, entirely dependent on the choice of crypto system.

Reply:  80-bit security level means that it will take a computer, on average, approximately 2^80 operations to break the encryption. For different crypto systems, the participant teams should find a set of parameters to achieve the equivalent security level.

 

Question: For task 1, is it that the data provided is the data used in the judgment?

Reply: No, we will use reserved data to evaluate submitted algorithms; but the format and scale stay the consistent with provided data.

 

Question: How many nodes are allowed for SMC?

Reply: We will allow up to two remote nodes, instantiated from our VM templates and the third node should carry as little computation as possible.

 

Question: Execution. Is it correct to assume that institution 1 will have its input data stored in a file (or files) and institution 2 will have its input data stored in a file when the execution begins, but they can't exchange input-dependent data prior to that?

Reply: Yes.

 

Question: Will the output need to be produced by both input parties or is having one of them produce the output going to suffice?

Reply: One party is enough.

 

Question: Will the programs start in all distributed locations at exactly the same time for timing purposes or otherwise what would the mechanism for discounting idle time while waiting for other parties to start be? (We perhaps can include time measurements of meaningful work from inside the program, but there may need to be a signaling mechanism to indicate that all parties are now online.)

Reply: We will evaluate the execution time on both sides.  We acknowledge that there will be variations due to synchronization and network delays, and therefore plan to rerun the programs to suppress the randomness. Most importantly, we will evaluate all participants' code in the same way, through multiple trials, to ensure the fairness.

 

Question: Can you provide information about how each program can locate other parties (e.g., by including their IP addresses in the program), that would ensure that the parties can compute together. (IP addresses are not needed at this point, but rather the knowledge that parties will know IP addresses of each other.)

Reply: We will adjust it through the evaluation. BTW, we will use domain name instead of IP addresses, which can be dynamic.

 

Question: Will we be allowed to compute any information "offline" (modeling computation done before the two parties interact, or before the genomes are received).

Reply: We can tolerant some minimum client participation after encrypted outcomes are received (to decrypt it and do some minimum calculation, e.g., division to obtain the final results). But preference will be given to solutions that can do everything in the encrypted form.

 

Question: Should our submissions be in a specific format in order to interface with your evaluation framework? In addition are we allowed to amend the packages installed on the VM (which would ultimately be required for installation on the test machine as well)?

Reply: You can install any package in the VM. We will evaluate the performance of your submission on another testing data in the VM. Therefore, we would prefer if you could upload your pre-configured VM for evaluation

 

Question: Are both parties intended to run on the same virtual machine for task 3? Will the evaluation be done on a single machine as well?

Reply: For the task 3, we can put both the 'client' and 'server' in the same VM, where the 'client' encrypts the data using Homomorphic encryption and the server calculates over the encrypted data. Finally, the client decrypts the results.  The VM will be a single VM running in a physics machine. However, the communication between server and client won’t be free. We expect that the client and the server, even residing in the same VM, communicate with each other through TCP, so we can measure the amount of data transferred between the two processes.

 

 

For Task 2:

 

Question: Is the data distributed among various organization? That is are say 100 records held by Hospita1, 100 records held by Hospital2 and 100 records held by Hospital3

and we need to return the top k records across the union of all the databases?

Reply: Yes, the data are distributed among different organizations (hospitals). In this task, we only consider a two-party scenario. One party (hospital A) has one or more query genes from different patients and the other party (hospital B) holds a target gene database (containing genomes from M patients). The goal is to find the K-nearest genes in the target database (held by hospital B) to each query gene (input by hospital A) and to return their IDs (not their gene sequences) to the corresponding queries in a secure manner.

 

Question: If the data is split between different organization how is this split done? Does a patient appear only in a single database? 

Reply: Please see our answer above. Note that the query genomes are not present in the database and each genome in the target database is distinct.

 

Question: Is the query asking party available for interaction?

Reply: Yes, the computation requires the interaction between parties. But there are only two parties in this scenario.

 

Question: How much leakage is allowed from the computation?

Reply: The only information leakage allowed in the computation is the final outcome of the computation, i.e., the IDs of the K-nearest genes (not the their sequences). No other information is allowed to be leaked. We assume both parties are honest-but-curirous.

 

Question: What is the scale of things? How many organizations are holding the records? How many records in each database? How long is the size of the genome sequence?

Reply: There are only two parties. The length of the gene sequence is expected to be as long as several thousands of SNPs. We hope to have 500 genes in the target database and there may be several hundreds of query genes.

 

Question: Is the query genome secret? 

Reply: yes, the query genome is secret from the other party.

  

Question: If you have a full description of a scenario that would be great.

Reply: We provided some sample codes (in the plaintext along with the sample data) to demonstrate the computation task.

 

Question: Would it be possible to elaborate on the quality requirement of the edit-distance?

Reply: we are looking for accurate edit distance calculation or better approximation than last year's approach. We intentionally select the example of genes whose edit distances cannot be approximated well based on last year's approximate algorithm. You can use either the rigorous (DP-based) edit distance calculation or any approximation you devised. We will evaluate your algorithm to see if the sequences in the databased with the smallest edit distance with the query sequence can be identified by your algorithm. You can test your algorithm first on the example we provide to see how the algorithm works.

 

 

For Task 3:

 

Question: Do we only consider exact match in Task 3?

Reply: Yes, We only consider the exact match for each variation in task 3.

 

Question: Do we consider information leakage based on run time of an execution in Task 3?

Reply: No, Running time related leakage is not considered.

 

Question: Which parties exactly hold the VCF file ? And the query information ? Are the two parties involved the "data owner" and the "cloud service” ?

Reply: Yes, the two parties are the data owner and the cloud service. The VCF files are encrypted in the cloud service using Homomorphic encryption. The query is also encrypted during the querying phase. The data owner can outsource the query computation and storage to the public cloud service provider through homomorphic encryption.

 

Question: Is the expected output simply the absence/presence of the specified biomarkers ? It is unclear otherwise how to determine the "probability of genetic disease”.

Reply: Yes, the outcome is the absence/presence of the specified biomarkers. The user can send multiple queries to get the ratio between matched queries and total queries.

 

Question: Is the time required for encryption of the genome part of the evaluation ? Is any other pre-computation allowed ?

Reply: Yes, we will evaluate the encryption, computation and decryption time. You can design your own data encoding algorithm for the VCF file before the encryption.

 

Question: Do the queries consist of a single character or sequences ?

Reply: Yes, It can be both individual characters or a short sequences (<100).

 

Question: more details about the scenario?

Reply: The data owner D owns the unencrypted vcf files. D encrypts these and transmits to the cloud service C. At a later date, D obtains a cleartext query which he encrypts and sends to C. C responds with an encryption which D can use to obtain the boolean result [match]/[no match]. In this scenario C should not learn anything about the query or the genome.

 

Question: Regarding the evaluation, what is the tradeoff between speed and storage space in the judging criteria ?

Reply: For the tradeoff between speed and storage, the storage will be taken into account, if and only if the speed performances end in a tie among different teams?