FAQ

  • What does each data field mean in "snp_pvalue.txt"?
    • For each SNP unit in the file, the first line consists of SNP ID, it's location, and allele frequencies on controls. The second line consists of genotypes on cases. The third line consists of allele frequencies on cases. The fourth line is p-value.

  • What statistics p-values correspond to?
    • It's calculated using chi square test function in SciPy.

  • Is there control data for task 2?
    • You can think the genotype of controls are unknow in task 2. The cases are generated from the personal genome project (PGP). You may consider genotypes of CEU population in HapMap as a reference if you needed.

  • For the purpose of designing the software package, what should be the format of the input and output?
    • For the format of the software package in task 2, the input will be similar to "snp_value.txt" and the output should be top K SNP (one ID in each row) based on chi square test. We will run your algorithm on a reserved set of SNPs, which are much larger, to evaluate the performance on top K feature selection.

  • How can I evaluate data privacy through the web service for task 1?
    • You may go to "Service" from "LOGIN" page and then submit a test file for either chr2 or chr10. The format of the test file is a line of minor allele frequencies on SNPs in the same order with those in case files. You will get a picture from the page. Please have a look at "Note" in "Service" page for the detail of result pictures.