Required Input

If you have any questions or issues, feel free to open an issue or directly email Drew Neavin (d.neavin @ garvan.org.au)

This table illustrates the input that the pipeline requires to run and whether it is provided or needs to be prepared and provided by the user.

Input

Category

User-provided

Developer-provided

Genotyped SNPs plink2 pfiles (on hg19)

Study Data

✔️

✖️

(Example dataset provided)

1000G hg38 imputation reference

Reference

✖️

✔️

Singularity image

Software

✖️

✔️

Snakemake

Software

✖️

✔️

scipy

Software

✖️

✔️

Reference

We will be using the 1000G hg38 reference to impute the data and have prepared the reference. You can access it by running

wget https://www.dropbox.com/s/l60a2r3e4vo78mn/eQTLGenImpRef.tar.gz
wget https://www.dropbox.com/s/eci808v0uepqgcz/eQTLGenImpRef.tar.gz.md5

After downloading the reference, it is best to make sure the md5sum of the eQTLGenImpRef.tar.gz file matches the md5sum in the eQTLGenImpRef.tar.gz.md5:

md5sum eQTLGenImpRef.tar.gz > downloaded_eQTLGenImpRef.tar.gz.md5
diff -s eQTLGenImpRef.tar.gz.md5 downloaded_eQTLGenImpRef.tar.gz.md5

which should return:

Files eQTLGenImpRef.tar.gz.md5 and downloaded_eQTLGenImpRef.tar.gz.md5 are identical

If you get anything else, the download was probably incomplete and you should try to download the file again. Then, unpack the contents of the file:

tar xvzf eQTLGenImpRef.tar.gz

Note

Some HPCs limit the amount of time that a command can run on a head node, causing it to stop/fail part way through so it is best to untar by using a submission script.

Now you should have the references that are needed to impute the SNP genotype data. You will have the following directory structure:

hg38
├── imputation
├── phasing
│   ├── genetic_map
│   └── phasing_reference
├── ref_genome_QC
└── ref_panel_QC

Data

We have provided a test dataset that can be used to test the pipeline and we have built it in to the singularity image (below). It will be used for the example below and can be used to test the pipeline. You can also download it directly from https://www.dropbox.com/s/uy9828g1r1jt5xy/ImputationTestDataset_plink.tar.gz and check complete download with https://www.dropbox.com/s/q49gppt7uu75wxr/ImputationTestDataset_plink.tar.gz.md5

For your own dataset, you will need to make sure you have all the following files in the correct formats. You can check the test dataset for an example.

Plink2 reference SNP genotype pfiles

Your reference SNP genotype data will need to be supplied in the plink2 format which includes 3 files: data.pgen, data.psam, data.pvar

Important

Your chromosome encoding in the data.pvar file must not use ‘chr’. For example, chromosome 1 would be encoded as ‘1’, not ‘chr1’. The pipeline will check for this before running and will not run if it finds ‘chr’ chromsome encoding.

Important

The data.psam file needs to be in a specific format since it will be important for:

  • Comparing reported sexes with SNP-genotype predicted sexes

  • Comparing reported ancestries with 1000 Genomes-projected ancestry predictions

  • Creating a per-individual meta-data file for use in WG3 (eQTL detection)

The psam must be tab separated with the following headers and contents should look like this (and requires these headings):

#FID

IID

PAT

MAT

SEX

Provided_Ancestry

genotyping_platform

array_available

wgs_available

wes_available

age

age_range

Study

smoking_status

hormonal_contraception_use_currently

menopause

pregnancy_status

113

113

0

0

1

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

78

70

OneK1K

NA

NA

NA

NA

349

350

0

0

1

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

81

80

OneK1K

NA

NA

NA

NA

352

353

0

0

2

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

89

80

OneK1K

NA

NA

NA

NA

39

39

0

0

2

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

56

50

OneK1K

NA

NA

NA

NA

40

40

0

0

2

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

53

50

OneK1K

NA

NA

NA

NA

41

41

0

0

1

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

63

60

OneK1K

NA

NA

NA

NA

42

42

0

0

2

EUR

IlluminaInfiniumGlobalScreeningArray

Y

N

N

76

70

OneK1K

NA

NA

NA

NA

Key for column contents:

  • #FID: Family ID

  • IID: Within-family ID

  • PAT: Within-family ID of father (‘0’ if father isn’t in dataset)

  • MAT: Within-family ID of mother (‘0’ if mother isn’t in dataset)

  • SEX: Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)

  • Provided_Ancestry: reported ancestry (‘AFR’ = African, ‘AMR’ = Ad Mixed American, ‘EAS’ = East Asian, ‘EUR’ = European, ‘SAS’ = South Asian). If you don’t know, use ‘NA’.

  • genotyping_platform: array genotyping was done on

  • array_available: ‘Y’ or ‘N’; whether SNP genotype array is available for this sample

  • wgs_available: ‘Y’ or ‘N’; whether whole genome sequencing is available

  • wes_available: ‘Y’ or ‘N’; whether whole exome sequencing is available

  • age: age in years of integer, NA if unknown.

  • age_range: age in decades - lower bound, NA if unknown.

  • Study: name of the study this donor was included in.

  • smoking_status: Whether the donor smokes or smoked in the past. Options are:’yes’: smokes at time of sample collection, ‘past’: smoked in the past but not at time of sample collection, ‘no’: never smoked, ‘NA’: unknown smoking status.

  • hormonal_contraception_use_currently: whether the donor is currently using hormonal contraception. Options are: ‘yes’ (currently using hormonal contraception), ‘no’ (not currently using hormonal contraception) or ‘NA’ (unknown status of contraception use). Note that male donors must be coded as ‘NA’.

  • menopause: Donor menopause status at the time of sample collection. Options are ‘pre’ (have not yet gone through menopause), ‘menopause’ (currently going through menopause), ‘post’ (completed menopause) or ‘NA’ (unknown menopause status or male). Note: that male donors must be coded as ‘NA’.

  • pregnancy_status: Donor pregnancy status at the time of sample collection. Options are ‘yes’ (pregnant at time of sample collection), ‘no’ (not pregnant at time of sample collection) or ‘NA’ (unknown pregnancy status or male). Note: that male donors must be coded as ‘NA’.

  • Any additional metadata can be added as additional columns

Important

The data.psam file will be used to generate a per-individual meta-data file for use in WG3 (eQTL detection) and will be uploaded to a shared own cloud. As such, it is important that you carefully consider whether any individual IDs need to be anonymized.

Next Steps

Now that you have the required inputs organized, you can move on to the Required Software for the imputation pipeline.