Required Input¶

If you have any questions or issues, feel free to open an issue or directly email Drew Neavin (d.neavin @ garvan.org.au)

This table illustrates the input that the pipeline requires to run and whether it is provided or needs to be prepared and provided by the user.

Input	Category	User-provided	Developer-provided
Genotyped SNPs plink2 pfiles (on hg19)	Study Data	✔️	✖️ (Example dataset provided)
1000G hg38 imputation reference	Reference	✖️	✔️
Singularity image	Software	✖️	✔️
Snakemake	Software	✖️	✔️
scipy	Software	✖️	✔️

Reference¶

We will be using the 1000G hg38 reference to impute the data and have prepared the reference. You can access it by running

wget https://www.dropbox.com/s/l60a2r3e4vo78mn/eQTLGenImpRef.tar.gz
wget https://www.dropbox.com/s/eci808v0uepqgcz/eQTLGenImpRef.tar.gz.md5

After downloading the reference, it is best to make sure the md5sum of the eQTLGenImpRef.tar.gz file matches the md5sum in the eQTLGenImpRef.tar.gz.md5:

md5sum eQTLGenImpRef.tar.gz > downloaded_eQTLGenImpRef.tar.gz.md5
diff -s eQTLGenImpRef.tar.gz.md5 downloaded_eQTLGenImpRef.tar.gz.md5

which should return:

Files eQTLGenImpRef.tar.gz.md5 and downloaded_eQTLGenImpRef.tar.gz.md5 are identical

If you get anything else, the download was probably incomplete and you should try to download the file again. Then, unpack the contents of the file:

tar xvzf eQTLGenImpRef.tar.gz

Note

Some HPCs limit the amount of time that a command can run on a head node, causing it to stop/fail part way through so it is best to untar by using a submission script.

Now you should have the references that are needed to impute the SNP genotype data. You will have the following directory structure:

hg38
├── imputation
├── phasing
│   ├── genetic_map
│   └── phasing_reference
├── ref_genome_QC
└── ref_panel_QC

Data¶

We have provided a test dataset that can be used to test the pipeline and we have built it in to the singularity image (below). It will be used for the example below and can be used to test the pipeline. You can also download it directly from https://www.dropbox.com/s/uy9828g1r1jt5xy/ImputationTestDataset_plink.tar.gz and check complete download with https://www.dropbox.com/s/q49gppt7uu75wxr/ImputationTestDataset_plink.tar.gz.md5

For your own dataset, you will need to make sure you have all the following files in the correct formats. You can check the test dataset for an example.

Plink2 reference SNP genotype pfiles¶

Your reference SNP genotype data will need to be supplied in the plink2 format which includes 3 files: data.pgen, data.psam, data.pvar

Important

Your chromosome encoding in the data.pvar file must not use ‘chr’. For example, chromosome 1 would be encoded as ‘1’, not ‘chr1’. The pipeline will check for this before running and will not run if it finds ‘chr’ chromsome encoding.

Important

The data.psam file needs to be in a specific format since it will be important for:

Comparing reported sexes with SNP-genotype predicted sexes
Comparing reported ancestries with 1000 Genomes-projected ancestry predictions
Creating a per-individual meta-data file for use in WG3 (eQTL detection)

The psam must be tab separated with the following headers and contents should look like this (and requires these headings):

#FID	IID	PAT	MAT	SEX	Provided_Ancestry	genotyping_platform	array_available	wgs_available	wes_available	age	age_range	Study	smoking_status	hormonal_contraception_use_currently	menopause	pregnancy_status
113	113	0	0	1	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	78	70	OneK1K	NA	NA	NA	NA
349	350	0	0	1	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	81	80	OneK1K	NA	NA	NA	NA
352	353	0	0	2	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	89	80	OneK1K	NA	NA	NA	NA
39	39	0	0	2	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	56	50	OneK1K	NA	NA	NA	NA
40	40	0	0	2	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	53	50	OneK1K	NA	NA	NA	NA
41	41	0	0	1	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	63	60	OneK1K	NA	NA	NA	NA
42	42	0	0	2	EUR	IlluminaInfiniumGlobalScreeningArray	Y	N	N	76	70	OneK1K	NA	NA	NA	NA
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…

Key for column contents:

#FID: Family ID
IID: Within-family ID
PAT: Within-family ID of father (‘0’ if father isn’t in dataset)
MAT: Within-family ID of mother (‘0’ if mother isn’t in dataset)
SEX: Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)
Provided_Ancestry: reported ancestry (‘AFR’ = African, ‘AMR’ = Ad Mixed American, ‘EAS’ = East Asian, ‘EUR’ = European, ‘SAS’ = South Asian). If you don’t know, use ‘NA’.
genotyping_platform: array genotyping was done on
array_available: ‘Y’ or ‘N’; whether SNP genotype array is available for this sample
wgs_available: ‘Y’ or ‘N’; whether whole genome sequencing is available
wes_available: ‘Y’ or ‘N’; whether whole exome sequencing is available
age: age in years of integer, NA if unknown.
age_range: age in decades - lower bound, NA if unknown.
Study: name of the study this donor was included in.
smoking_status: Whether the donor smokes or smoked in the past. Options are:’yes’: smokes at time of sample collection, ‘past’: smoked in the past but not at time of sample collection, ‘no’: never smoked, ‘NA’: unknown smoking status.
hormonal_contraception_use_currently: whether the donor is currently using hormonal contraception. Options are: ‘yes’ (currently using hormonal contraception), ‘no’ (not currently using hormonal contraception) or ‘NA’ (unknown status of contraception use). Note that male donors must be coded as ‘NA’.
menopause: Donor menopause status at the time of sample collection. Options are ‘pre’ (have not yet gone through menopause), ‘menopause’ (currently going through menopause), ‘post’ (completed menopause) or ‘NA’ (unknown menopause status or male). Note: that male donors must be coded as ‘NA’.
pregnancy_status: Donor pregnancy status at the time of sample collection. Options are ‘yes’ (pregnant at time of sample collection), ‘no’ (not pregnant at time of sample collection) or ‘NA’ (unknown pregnancy status or male). Note: that male donors must be coded as ‘NA’.
Any additional metadata can be added as additional columns

Important

The data.psam file will be used to generate a per-individual meta-data file for use in WG3 (eQTL detection) and will be uploaded to a shared own cloud. As such, it is important that you carefully consider whether any individual IDs need to be anonymized.

Next Steps¶

Now that you have the required inputs organized, you can move on to the Required Software for the imputation pipeline.