SHAPEIT SHAPEIT Home Citations What's new? Download and Licence Contact Getting started Algorithm parameters Model parameters Input file options Output file options Pre-phasing Imputation Using family information Using a reference panel Phasing X chromosomes Read aware phasing Using phasing uncertainty Genotype calling Options list File format specification Version history Last update: 17 January 2018 SHAPEIT SHAPEIT is a fast and accurate method for estimation of haplotypes (aka phasing) from genotype or sequencing data. SHAPEIT has primarily been developed by Dr Olivier Delaneau through a collaborative project between the research groups of Prof Jean-Francois Zagury at CNAM and Prof Jonathan Marchini at Oxford. Funding for this project has been received from several sources : CNAM, Peptinov, MRC, Leverhulme, The Wellcome Trust. SHAPEIT has several notable features: - Linear complexity with the number of SNPs and conditioning haplotypes. - Whole chromosome GWAS scale datasets can be phased in a single run. - Phasing individuals with any level of relatedness - Phasing is multi-threaded to tailor computational times to your resources. - Handles X chromosomes phasing. - Phasing using a reference panel (eg.1,000 Genomes) to aid phasing - Ideal for pre-phasing imputation together with IMPUTE2 SHAPEIT is free for academic use only. For commercial use please see here Citations The methods implemented in the SHAPEIT software are described in the following papers. Please cite one or more of these papers if you use our software in your study. O. Delaneau, J. Marchini, JF. Zagury (2012) A linear complexity phasing method for thousands of genomes. Nat Methods. 9(2):179-81. doi: 10.1038/nmeth.1785 [LINK] O. Delaneau, JF. Zagury, J. Marchini (2013) Improved whole chromosome phasing for disease and population genetic studies. Nat Methods. 10(1):5-6. doi: 10.1038/nmeth.2307 [LINK] O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696 [LINK] J. O'Connell, D. Gurdasani, O. Delaneau, et al. (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics [LINK] O. Delaneau, J. Marchini, The 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications 5 3934 [LINK] What's new? The latest software release is v2 (r900) Download and Licence SHAPEIT is freely available for academic use. To see rules for non-academic use see below. A LICENCE file is also included with each software download. Pre-compiled SHAPEIT binaries and example files can be downloaded from the links below. The latest software release is v2 (r900). We support only the most recent version. Platform GLIBC Type File Linux (x86_64) v2.12 Static shapeit.v2.r904.glibcv2.12.linux.tar.gz Linux (x86_64) v2.17 Static shapeit.v2.r904.glibcv2.17.linux.tar.gz To unpack the files on a Linux computer, use a command like this: tar -zxvf shapeit.v2.r900.glibcv2.17.linux.tar.gz This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called shapeit, a LICENCE file, and an example/ directory that contains example data files. Commercial License Agreement A specific license must be obtained for any commercial or for-profit organization or for any web-diffusion purpose. For more information you should contact both Prof. Zagury (zagury@cnam.fr) and Prof. Marchini (marchini@stats.ox.ac.uk). Academic License Agreement The bioinformatics department of Conservatoire National des Arts et Metiers (CNAM) has developed a new algorithm for a faster computation of hidden Markov models, based on graph representations. This algorithm has been notably applied for the reconstruction of haplotypes from population genotypic data leading to the SHAPEIT software. This algorithm and its applications, including SHAPEIT, are patent pending. The Conservatoire National des Arts et Metiers (CNAM), Prof. Jean-Francois ZAGURY and his group of the bioinformatics department (the developers), give permission for you and your laboratory (Institution) to use SHAPEIT. CNAM and the developers allow researchers at your Institution to copy and modify SHAPEIT for internal, non-profit research purposes, on the following conditions: The SHAPEIT software remains at your Institution and is not published, distributed, or otherwise transferred or made available to other than Institution employees and students involved in research under your supervision. If you wish to obtain SHAPEIT for any commercial purposes or for diffusion through the internet, you will need to execute a separate licensing agreement with CNAM and pay a fee. This includes, but is not limited to, using SHAPEIT to provide services to outside parties for a fee. In that case please contact : Pr. Zagury, CNAM. Tel : 33 1 58 80 88 20 Mail : zagury at cnam.fr The software is distributed under the following conditions of use You retain in SHAPEIT and any modifications to SHAPEIT, the copyright, trademark, or other notices pertaining to SHAPEIT as provided by CNAM. You provide the developers with feedback on the use of SHAPEIT in your research, and that the Developers and CNAM are permitted to use any information you provide in making changes to the SHAPEIT software. All bug reports and technical questions shall be sent to the mail list here https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OXSTATGEN You acknowledge that the developers, CNAM and its licensees may develop modifications to SHAPEIT that may be substantially similar to your modifications of SHAPEIT, and that the developers, CNAM and its licensees shall not be constrained in any way by you in CNAM's or its licensees' use or management of such modifications. You acknowledge the right of the developers and CNAM to prepare and publish modifications to SHAPEIT that may be substantially similar or functionally equivalent to your modifications and improvements, and if you obtain patent protection for any modification or improvement to SHAPEIT you agree not to allege or enjoin infringement of your patent by the Developers, CNAM or by any of CNAM's licensees obtaining modifications or improvements to SHAPEIT from the CNAM or the Developers. If utilisation of the SHAPEIT software results in outcomes which will be published, please specify the version of SHAPEIT you used and cite one of the following publications. O. Delaneau, J. Marchini, JF. Zagury (2012) A linear complexity phasing method for thousands of genomes. Nat Methods. 9(2):179-81. doi: 10.1038/nmeth.1785 O. Delaneau, JF. Zagury, J. Marchini (2013) Improved whole chromosome phasing for disease and population genetic studies. Nat Methods. 10(1):5-6. doi: 10.1038/nmeth.2307 O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696 J. O'Connell, D. Gurdasani, O. Delaneau, et al. (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics O. Delaneau, J. Marchini, The 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications. Any risk associated with using the SHAPEIT software at your institution is with you and your Institution. SHAPEIT is experimental in nature and is made available as a research courtesy "AS IS," without obligation by CNAM to provide accompanying services or support. CNAM AND THE AUTHORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES REGARDING THE SOFTWARE, WHETHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES PERTAINING TO MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Contact To ask a question about SHAPEIT please subscribe to the OXSTATGEN mailing list. If you experience any problems with SHAPEIT, before emailing the mail list please: Make sure you are using the latest version since your problem may have already been fixed Check carefully the screen output and the log file. The problem may be reported here since they both contain many details about what is going on. If the problem persists, ask your question on the OXSTATGEN mailing list and we will be happy to answer! Getting started SHAPEIT is primarily a tool for inferring haplotypes from SNP genotypes. It takes as input a set of genotypes and a genetic map, and produces as output, either a single set of estimated haplotypes, or a haplotype graph that encapsulates the uncertainty about the underlying haplotypes. The example dataset (gwas.ped / gwas.map) provided with the SHAPEIT software package contains 1760 SNPs typed in X trios, Y duos and Z unrelateds. To phase this data using default settings, use this command line: shapeit --input-bed gwas.bed gwas.bim gwas.fam \ --input-map genetic_map.txt \ --output-max gwas.phased.haps gwas.phased.sample The meaning of the arguments are: --input-bed gwas.bed gwas.bim gwas.fam specifies the filenames and the format of the genotypes that need phasing. --input-map genetic_map.txt specifies the filename of the genetic map needed to improve phasing quality. --output-max gwas.phased.haps gwas.phased.sample specifies the files where to write the haplotypes estimated by SHAPEIT. Speeding up phasing using multi-threading We strongly advice to use the multi-threading capabilities of SHAPEIT. For example, to phase the example dataset by taking advantage of a 8-core computer, use this command line: shapeit --input-bed gwas.bed gwas.bim gwas.fam \ --input-map genetic_map.txt \ --output-max gwas.phased.haps gwas.phased.sample \ --thread 8 More details can be found here. Short option names You can specify most options using short option names and their arguments using implicit file extensions to reduce command line length. See the option list for more details. As a short example, the previous command line can now be rewritten as: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ -T 8 This command line will have extactly the same behaviour as the previous one. Note that in the rest of the documentation, we will use this compact form when possible. Algorithm parameters Multi-threading (--thread) You can change the number of threads that SHAPEIT uses. This option was designed to speed up SHAPEIT when running on multi-core computers. For example, if you have a 4-core computer, then you can run the following command: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --thread 4 If there are 1,000 individuals in the dataset, SHAPEIT using 4 threads will phase 4 individuals simultaneously conditional upon the 1000 - 4 = 996 others individuals. It allows to divide the running time of SHAPEIT by almost 4 in this case, so 4 hours becomes 1! This option is recommended only if you have a large number of individuals in your dataset. Note that using a number of threads bigger than the number of available cores will not further decrease the running times. Iteration number (--burn, --prune and --main) You can change the default number of MCMC iterations with the following options: --burn X : where X is the number of the burn-in iterations. The default value is 7 which is sufficient for most application. Those iterations are used to find a good starting haplotype estimate. --prune Y : where Y is the number of iterations of the pruning stage. The default value is 8. Transition probabilities are stored across pruning iterations and used to prune the genotype graphs in order to get more parsimonious graphs. --main Z : where Z is the number of main iterations. The default value is 20. The final haplotype estimate is found by averaging across the main iterations. Hereafter, how to run SHAPEIT with 10 burn-in iterations, 10 pruning iterations and 50 main iterations: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --burn 10 \ --prune 10 \ --main 50 Seed specification (--seed) Since SHAPEIT is a stochastic algorithm, the only way to reproduce twice exactly the same results is to use the same seed value for the random number generator. You can specify the seed of the random number generator with the --seed option as follows: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --seed 123456789 By default, SHAPEIT initialises the seed by calling the stdlib function time(NULL). Note that the seed value can be found in the log file. Note also that you cannot reproduce two multi-threaded runs even if you specify the seed. Model parameters Conditioning states K (--states) You can increase accuracy of SHAPEIT by increasing the number of conditioning states on which haplotype estimation is based. By default, SHAPEIT uses 100 states per SNP across the dataset which gives good accuracy while maintaining reasonable running times. The complexity of the algorithm is linear with the number of conditioning states, so feel free to increase this number if your computational resources allow it. For instance, if you set this number to 200, it will take times longer than with 100 states. You can set the number of conditioning states with the --states option as follows: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --states 200 Window size W (--window) Another important parameter in the SHAPEIT2 model is the window size. For GWAS dataset, we recommend a value of 2Mb, which should be fine for most application. By default, SHAPEIT2 uses windows of ~2Mb on average. However, for sequence data, our experiments suggest that a value of 0.5Mb may give better results. The window size can be changed to 0.5Mb by using the --window option as follows: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --window 0.5 Using the graphical model of SHAPEIT version 1 (--model-version1) We have retained within the software the model from SHAPEIT v1 which uses a representation of the conditioning haplotypes within a graph. To use this model you can use the --model-version1 flag. In this case, the number of conditioning states is controlled by the --states options as described above. Therefore, to run SHAPEIT1 model on the example dataset, just use this command line: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --model-version1 Genetic map Recombination rates between SNPs are provided via a genetic map that specifies the genetic position of the SNPs in cM. To specify to SHAPEIT to use such a genetic map for phasing, you have to use the option --input-map as follows: shapeit -B gwas \ --input-map chr20.gmap.gz \ -O gwas.phased We recommend that most of the SNPs of your dataset should have a genetic position specified into this file. For the SNPs that don't have a genetic position, SHAPEIT internally determines its genetic position using linear interpolation. If the intersection between your SNP map (BIM, MAP or GEN file) and your genetic map (GMAP file) is poor, you should verify that the positions in both files are from the same build of the Human genome (b36 or b37 for example). For genetic map for human populations see here. If no genetic map is provided, SHAPEIT assumes a constant recombination rates between SNPs. This is unrealistic in human populations and thus may harm accuracy. The default value is 0.0004. You can change this value to 0.001 with the option --rho as follows: shapeit -B gwas \ -O gwas.phased \ --rho 0.001 Effective population size To specify the effective population size, use the option --effective-size. The default value for the effective population size is 15,000 which should be fine for most Human populations. A command line example that specifies an effective population size of 11,418: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --effective-size 11418 For information, the effective population sizes estimated for the HapMap phase II populations are: European (CEU) : 11,418 African (YRI) : 17,469 Asian (CHB+JPT) : 14,269 Input file options Filter SNPs and individuals with a poor calling rate It is recommended to remove SNPs and individuals with a poor calling rate since they slow down computations, increase memory usage and are likely to be error-prone. To remove all SNPs and individuals with a poor call rate, use Plink if your dataset is encoded in a PED/MAP or BED/BIM/FAM format and QCTOOL if it is encoded in the GEN/SAMPLE format. Note that SHAPEIT throws an error if a completely missing SNP or individual (call rate = 0%) is included in the dataset. Split the dataset by chromosome You must split the dataset by chromosomes prior to phasing since SHAPEIT phases only one chromosome at a time. To do so, you can use the following Plink command for example: for chr in $(seq 1 22); do plink --file myGwasData \ --chr $chr \ --recode \ --out myGwasData.chr$chr ; done You can also proceed as follows with a GEN/SAMPLE file: for chr in $(seq 1 22); do cat myGwasData.gen | awk -v c=$chr '{ if ($1 == c) { print $0 } }' > chr$chr\.unphased.gen ; done Input file formats You can specify the unphased genotypes to SHAPEIT with 4 different file formats: Plink PED/MAP format: The file format is described here. Use as follows the --input-ped or -P options to specify unphased genotypes in this format: shapeit --input-ped gwas.ped gwas.map \ -M genetic_map.txt \ -O gwas.phased In this file format, SHAPEIT considers "0" as missing data. To change the missing data character to "N" for example, use --missing-code options as follows: shapeit --input-ped gwas.ped gwas.map \ -M genetic_map.txt \ -O gwas.phased \ --missing-code N Plink BED/BIM/FAM format: The file format is described here. Use as follows the --input-bed or -B options to specify unphased genotypes in this format: shapeit --input-bed gwas.bed gwas.bim gwas.fam \ -M genetic_map.txt \ -O gwas.phased Oxford GEN/SAMPLE format: The file format is described here. Use as follows the --input-gen or -G options to specify unphased genotypes in this format: shapeit --input-gen gwas.gen gwas.sample \ -M genetic_map.txt \ -O gwas.phased In this file format, you may have some probabilistic genotypes as shown in this small example: 7 SNP1 123 A G 1 0 0 0 1 0 0.1 0.9 0 0 1 0 7 SNP2 456 T C 0 1 0 0 0 1 1 0 0 0 1 0 7 SNP3 789 A T 0.2 0.3 0.5 0 1 0 0 1 0 0 0 1 SHAPEIT does not take into account uncertainty in the genotypes. Instead, it fixes the genotypes when their maximum probability exceeds a given threshold and leaves as missing the others. By default, SHAPEIT uses a threshold of 0.9 which means that the example dataset described above is considered by SHAPEIT as: 7 SNP1 123 A G 1 0 0 0 1 0 0 1 0 0 1 0 7 SNP2 456 T C 0 1 0 0 0 1 1 0 0 0 1 0 7 SNP3 789 A T 0 0 0 0 1 0 0 1 0 0 0 1 Note that IND1 at SNP3 is missing since none of the genotype probabilities is above the threshold of 0.9. You can modify the threshold value to 0.8 for example by adding the option --input-thr to the SHAPEIT command line as follows: shapeit --input-gen gwas \ -M genetic_map.txt \ -O gwas.phased \ --input-thr 0.8 VCF Variant Call Format: The file format is described here. Use as follows the --input-vcf or -V options to specify unphased genotypes in this format: shapeit --input-vcf gwas.vcf \ -M genetic_map.txt \ -O gwas.phased Implicit filenames If all input files have their default file extensions, that is: *.ped and *.map for PED/MAP format, *.bed, *.bim and *.fam for BED/BIM/FAM format, *.gen and *.sample for GEN/SAMPLE format. *.vcf for VCF format. Then, they can be specified to SHAPEIT using the following shorter forms: shapeit --input-ped gwas -M genetic_map.txt -O gwas.phased shapeit --input-bed gwas -M genetic_map.txt -O gwas.phased shapeit --input-gen gwas -M genetic_map.txt -O gwas.phased shapeit --input-vcf gwas -M genetic_map.txt -O gwas.phased Read GZIP or BZIP2 input files SHAPEIT can read all input files compressed using GZIP or BZIP2 as soon as the correct file extension is specified. SHAPEIT recognises a GZIP file if the corresponding filename is of the form filename.gz and a BZIP2 file if the filename is of the form filename.bz2. Given the correct file extension, SHAPEIT will internally uncompress the file in order to read it. For example, if your GEN and GMAP files are gzipped, you can read them using the command: shapeit -G gwas.gen.gz gwas.sample \ -M genetic_map.txt \ -O gwas.phased Data checks SHAPEIT automaltically checks the input genotype data prior to phasing. It throws warnings when it detects: Individuals or sites with a rate of missing data higher than 5%. Monomorphic or singleton SNPs since they are not informative for phasing And errors when an individual or a site has a missing data rate of 100%. In addition, SHAPEIT generates 2 additional log files to facilitate the detection of any problems in the input data. More specifically, if the log file is myLogFile.log, then the 2 following additional files will be generated: myLogFile.ind.mm contains the individual missing rates (N lines). myLogFile.snp.mm contains the SNP missing rates and allele frequencies (L lines). Here is an example of the myLogFile.ind.mm content: id missing HG00098 54 HG00105 0 HG00107 3 HG00115 2 HG00132 1 HG00145 7 HG00153 5 HG00157 0 HG00181 1 ... The first column gives the individual ID and the second the number of sites with missing data the individual contains. Here is an example of the myLogFile.snp.mm: id position A B missing count_A_main count_B_main count_A_founder count_B_founder SNP20-8948779 9000779 T C 0 132 1930 104 1646 rs6140828 9000854 C T 0 568 1494 464 1286 SNP20-8950466 9002466 C T 0 921 1141 817 933 SNP20-8951666 9003666 G A 2 127 1931 97 1649 SNP20-8956851 9008851 A G 0 5 2057 4 1746 rs8118170 9009163 G A 0 62 2000 49 1701 rs6108236 9009406 C A 0 170 1892 137 1613 SNP20-8957905 9009905 G T 0 155 1907 125 1625 SNP20-8959047 9011047 G A 2 683 1375 562 1184 ... The columns are: id: The SNP id (ex: rs id) position: The SNP physical position in bp A: The first allele (ref allele in VCF) B: The second alleles (alt allele in VCF) missing: The missing genotype count. That is the number of samples with missing data at this site. count_A_main: The allele A count. The number of haplotypes carrying the A allele. count_B_main: The allele B count. The number of haplotypes carrying the B allele. count_A_founder: The allele A count in founder haplotypes only. count_B_founder: The allele B count in founder haplotypes only. In this file, you should have: N = missing + (count_A_main + count_B_main)/2 The two files can be easily post-processed in R using the following commands: ind_data <- read.table("myLogFile.ind.mm", hea=T) snp_data <- read.table("myLogFile.snp.mm", hea=T) Importantly, you can generate these two files without performing any phasing. To do so, just use the following SHAPEIT command: shapeit -check -B gwas --output-log gwas.stats It will generate the files gwas.stats.ind.mm and gwas.stats.snp.mm. Subsets of unphased genotypes Read the genotypes in a window To only phase the variant sites located in a particular window, use the two following options: --input-from : Lower bound of the window in bp. The default value is 0. --input-to : Upper bound of the window in bp. The default value is 1e9. Here is an example to phase sites only within the range [9.1, 9.6] Mb: shapeit -B gwas \ -M genetic_map.txt \ --input-from 9.1e6 \ --input-to 9.6e6 \ -O gwas.phased Excluding/Including SNPs To exclude some particular variant sites from the phasing process, use the option --exclude-snp together with a file listing the site position you want to exclude. Example: shapeit -B gwas \ -M genetic_map.txt \ --exclude-snp gwas.subset.site \ -O gwas.phased Where each line of the file gwas.subset.site corresponds to a physical position that has to be excluded. Example: 9000779 9000854 9002466 9003666 9008851 9009163 9009406 ... Similarly, you can use the option --include-snp to only phase a particular set of variants: shapeit -B gwas \ -M genetic_map.txt \ --include-snp gwas.subset.site \ -O gwas.phased Excluding/Including individuals To exclude individuals from the phasing process, use the option --exclude-ind together with a file listing the IDs of the individuals you want to exclude. Example: shapeit -B gwas \ -M genetic_map.txt \ --exclude-ind gwas.subset.inds \ -O gwas.phased Where each line of the file gwas.subset.inds corresponds to an individual that needs to be excluded. Example: HG00098 HG00105 HG00107 HG00115 HG00132 HG00145 HG00153 HG00157 ... Similarly, you can use the option --include-ind to only phase a particular set of samples: shapeit -B gwas \ -M genetic_map.txt \ --include-ind gwas.subset.inds \ -O gwas.phased Output file options Log file Each time SHAPEIT is run, the screen output is copied in a log file. By default, the name of the log file is defined as being: shapeit_[date]_[time]_UUID.log For example, a log file shapeit_23112012_14h11m23s_51b01c25-336f-4eef-98bc-c1c5ac97a4cb.log tells that SHAPEIT was run the 23th of November at 14:11. The UUID (Universal Unique IDentifier) avoids any collision between log filenames. You can specify your own log file name by using the --output-log option: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --output-log gwas.phased.log Output best haplotypes in HAPS/SAMPLE format To output the most likely pairs of haplotypes for each sample in Impute2 format, use the option --output-max: shapeit -B gwas \ -M genetic_map.txt \ --output-max gwas.phased.haps gwas.phased.sample Or in short form: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased Haplotypes can also be compressed using GZIP or BZIP2 just by adding correct file extension on the filenames: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased.haps.bz2 gwas.phased.sample Output haplotype graphs to capture phasing uncertainty To output the haplotype graphs that store phasing uncertainty in a compact way, use the option --output-graph: shapeit -B gwas \ -M genetic_map.txt \ --output-graph gwas.graph Then, you can extract the most likely pairs of haplotypes from the graphs by using: shapeit -convert \ --input-graph gwas.graph \ -O gwas.phased This generates the two files gwas.phased.haps and gwas.phased.sample. Alternatively, you can just sample a pair of haplotypes for each indivudual using the --output-sample option: shapeit -convert \ --input-graph chr20.graph \ --output-sample gwas.phased Subsets of phased haplotypes SHAPEIT can extract subsets (sites/samples) of haplotypes specified in HAPS/SAMPLE format. This option builds on the following base command line: shapeit -convert \ --input-haps gwas.phased \ --output-haps gwas.phased.subset Then, to exclude/include sites/samples from gwas.phased.haps and gwas.phased.sample, use it in combination of the following options: --output-from INT: Output haplotypes only for SNPs with positions >= INT (Default = 0). --output-to INT: Output haplotypes only for SNPs with positions <= than INT (Default = 10e9). --exclude-ind FILE.inds: Output haplotypes only for individuals not in FILE. --include-ind FILE.inds: Output haplotypes only for individuals in FILE. --exclude-snp FILE.snps: Output haplotypes only for SNPs not in FILE. --include-snp FILE.snps: Output haplotypes only for SNPs in FILE. For example, this command extracts the haplotypes of the individuals listed in gwas.subset.inds only at sites included in gwas.subset.site: shapeit -convert \ --input-haps gwas.phased \ --output-haps gwas.phased.subset \ --include-ind gwas.subset.inds \ --include-ind gwas.subset.site Where in the files gwas.subset.inds and gwas.subset.site have the same format than those described above in the section "Subsets of unphased genotypes". Convert the SHAPEIT haplotypes to other formats SHAPEIT can convert haplotypes in HAPS/SAMPLE format to two different formats. To convert haplotypes to the HAP/LEGEND/SAMPLE usually used for reference panels of haplotypes in Impute2: shapeit -convert \ --input-haps gwas.phased \ --output-ref gwas.phased.hap gwas.phased.leg gwas.phased.sam Alternatively, you can convert the haplotypes to VCF format using: shapeit -convert \ --input-haps gwas.phased \ --output-vcf gwas.phased.vcf Prephasing for imputation A major use of phasing is haplotype estimation of GWAS samples in order to speed up imputation from large reference panel of haplotypes such as 1000 Genomes. The current recommendation is that GWAS samples are first 'pre-phased' using the most accurate method available. The subsequent imputation step (which involves imputing alleles from one set of haplotypes into another set) is fast. As new haplotype reference sets become available imputation can be re-run much more efficiently. The approach we recommend is: 1. Phase the GWAS samples with SHAPEIT 2. Impute non-typed SNPs into SHAPEIT haplotypes with IMPUTE2. Step1: Alignment of the SNPs SNP positions in build 37 The most recent 1,000 genomes haplotypes are defined at sites that use build37 coordinate system of the Human Genome. Thus., you have to make sure that your GWAS sites are also in the same build. If it is not the case, you can use the UCSC liftOver tool to perform the conversion to build37 coordinates. Strand alignment This is a crucial step of prephasing/imputation to make sure that the GWAS dataset is well aligned with the reference panel of haplotypes. You can check SNP alignment in two steps with Plink (step1 and step2) or with GTOOL. You can also check strand alignment using SHAPEIT as described in detail in the section "Using Reference panel". Step2: Phasing the GWAS samples Once you GWAS dataset is correctly aligned to the reference panel, we strongly recommend to phase each chromosome in a single run instead of making chunks. It makes the procedure much easier and increase downstream imputation quality. To do so, use the following SHAPEIT command line: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased Notes for cluster users: Suppose that you want to prephase your GWAS on a cluster where each node has X CPU cores. In this case, the approach we recommend is to first reserve a complete cluster node for each SHAPEIT job, and then to run each SHAPEIT job with X threads to fully load the CPU-cores of a node. Notes for server users: Suppose that you want to prephase your GWAS on a server with Y CPU cores (usually, Y=8,12,16,32,48 or 64). In this case, the approach we recommend is to phase the chromosomes in decreasing size order (the biggest one to the smallest one, that is from chr1 to chr22 for Humans). Then, to use several parallel SHAPEIT jobs (J), each using several threads (T) such that the server is fully loaded (JxT=#CPU-cores). Specifically, we recommend: Y #jobs (J) #threads (T) 8 1 8 12 2 6 16 2 8 32 4 8 48 8 6 64 8 8 To set up this approach, xargs is a really useful Linux command. To use it, you can proceed in 2 steps: Step1: Generate all the command lines (1 per job): for i in $(seq 1 22); do echo "-B gwas.chr$i -M gmap.chr$i\.txt -O gwas.chr$i\.phased -T T" >> myCommands.txt done Step2: Run all the jobs using xargs: cat myCommands.txt | xargs -PJ -n8 shapeit & Then, you have J jobs running in parallel on your server, each one using T threads, and your server is loaded at 100%. Step3: Imputation of the GWAS samples Once SHAPEIT has produced haplotype estimates, use IMPUTE2 to impute untyped genotypes with the latest release of the 1000 Genomes haplotypes. On the previous example, this is done for the chunk [9.1, 9.6] Mb with the command: impute2 -use_prephased_g \ -known_haps_g gwas.phased.haps \ -h reference.haplotypes.gz \ -l reference.legend.gz \ -m genetic_map.txt \ -int 9.1e6 9.6e6 \ -Ne 20000 \ -o gwas.from9.1e6_to9.6e6.imputed We strongly advice to use the latest IMPUTE2 version available here. Several comments on the previous command line: The flag -use_prephased_g is required to set IMPUTE2 in the prephasing mode Prephased GWAS haplotypes are specified using -known_haps_g The option -int 9.1e6 9.6e6 is used to specify the region to be imputed. If you are imputing across the genome then you need to split the genome up into small chunks, and impute each one separately using a separate job. Please refer to the IMPUTE2 website here for more information on how to do this. Note that if you deal with X chromsome imputation, you should carry out SHAPEIT pre-phasing and IMPUTE2 imputation using the option --chrX for both. Using family information We provide two sets of options for estimating haplotypes when there is relatedness between samples Very general functionality to estimate haplotypes in any sized pedigrees More focussed functionality that allows phasing of mother-father-child trios and parent-child duos. Phasing in general pedigrees In the following paper we investigated the use of SHAPEIT for phasing related samples. J. O'Connell, D. Gurdasani, O. Delaneau, et al. (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics [LINK] The conclusions of this paper are that SHAPEIT can be an accurate method of phasing across a whole spectrum of relatedness, from explicitely related families or pedigrees, through closely related samples such as those found in isolated populations, to unrelated samples. Our new approach has two steps. Firstly, all the samples are phased ignoring the pedigree information. This phasing is very accurate despite not using the pedigree information. The reason is that the sharing of long-range haplotypes between related samples actually helps the phasing within the MCMC algorithm that SHAPEIT implements. Secondly, we have developed a new method (called duoHMM) that takes the estimated haplotype from the first step, and combines it with the known pedigree information to further improve the phasing. This method will ensure that the haplotypes are consistent with the pedigree structure, leading to increased accuracy. To carry out both these steps internally within SHAPEIT you should simply use the --duohmm. This option will report any Mendel errors in the pedigree. Such genotypes will be set to missing, and then imputed during the phasing. You should be aware of this as it means some genotypes may change between input and output datasets. We have also found it to be (slightly) advantageous to use a larger than default window when large amounts of IBD sharing are present hence we use the -W 5 here shapeit -B gwas-nomendel \ -M genetic_map.txt \ --duohmm \ -W 5 \ --output-max gwas-duohmm \ --output-graph gwas-duohmm.graph A side product of the duoHMM approach is that it can be used to detect recombination events and non-mendelian genotyping errors. If this is the type of analysis that you want to carryout then we provide a stand alone implementation of the duoHMM method here. Phasing of mother-father-child trios and parent-child duos You can specify parent-child relationships between individuals as in Plink, that is using using the PED/MAP and BED/BIM/FAM file formats. Note that conversely to Plink, SHAPEIT does not use the family ID to build up the pedigrees, which explains why it is important that each sample has an unique ID. You can also use the GEN/SAMPLE format to specify parent-child as long as you add two additional columns in the SAMPLE file that specify the relationships between individuals. These two additional columns need to be called ID_father and ID_mother in the header line of the SAMPLE file in order for SHAPEIT to recognize them. Example: ID_1 ID_2 missing ID_father ID_mother 0 0 0 0 0 IND1 IND1 0 0 0 IND2 IND2 0 0 0 TRIOF TRIOF 0 0 0 TRIOM TRIOM 0 0 0 TRIOC TRIOC 0 TRIOF TRIOM DUOM DUOM 0 0 0 DUOC DUOC 0 0 DUOM Then, SHAPEIT builds internally the mother-father-child trios using the IDs of the father and mother of each individual. Additionally to the basic data checks done on site frequencies and calling rates, SHAPEIT also checks Mendel errors when dealing with family data. To run the checks, do: shapeit -check \ -B gwas \ --output-log gwas.checks It throws warnings when high rates of Mendel errors are detected at the family or site level and generates 2 log files to easily spot any problems: gwas.checks.ind.me contains the Mendel errors at the individual level (N lines). gwas.checks.snp.me contains the Mendel errors at the SNP level (L lines). Hereafter an example of gwas.checks.snp.me: id position mendel total SNP20-8948779 9000779 0 156 rs6140828 9000854 0 156 SNP20-8950466 9002466 0 156 SNP20-8951666 9003666 0 156 SNP20-8956851 9008851 0 156 rs8118170 9009163 0 156 rs6108236 9009406 0 156 ... The columns are variant site ID and position, number of mendel errors, total number of duos-trios. If you divide the 3rd column by the last, you get the Mendel error rate. An example of gwas.checks.ind.me: id father_id father_mendel mother_id mother_mendel father_mother_mende famID HG00098 0 0 0 0 0 0 HG00105 0 0 0 0 0 1 HG00107 0 0 0 0 0 2 ... HG01502 HG01500 1 HG01501 1 2 157 HG01503 0 0 0 0 0 158 HG01504 0 0 0 0 0 158 HG01505 HG01503 0 HG01504 1 1 158 HG01506 0 0 0 0 0 159 HG01507 0 0 0 0 0 159 ... Where the columns are: id: The individual id father_id: The father id for this individual father_mendel: The number of Mendel error when considering only the duo father/child mother_id: The mother id for this individual mother_mendel: The number of Mendel error when considering only the duo mother/child father_mother_mendel: The number of Mendel error when considering the full trio father/mother/child famID: An unique id internally assigned by SHAPEIT to each distinct family (useful to determine how many unrelated families there are). These files are quite useful to filter out badly called variant sites and identify spurious trios or duos. To do so, you can use R to open the 2 log files as follows: ind_mendel <- read.table("gwas.checks.ind.me", hea=T) snp_mendel <- read.table("gwas.checks.snp.me:", hea=T) Then, you can filter out SNPs according to the Mendel error rate mendel/total. And you can easily identify spurious duos and trios by only examining the father_mother_mendel column; a large value suggesting that there is an error in the father/mother assignment to the child. You can also examine father_mendel and mother_mendel columns to determine if one of them or both were incorrectly set as parent. As soon as family information is correctly specified, SHAPEIT automatically uses it when running with the standard command line that follows: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased If you want to disable this feature, use the --noped flag: shapeit -B gwas \ -M genetic_map.txt \ -O gwas.phased \ --noped This makes SHAPEIT consider all samples in the data as unrelated. Phasing with a reference panel SHAPEIT can use publicly available reference panel of haplotypes, such as the one provided by the 1,000 Genomes project, to help phasing. This approach is particularly useful when phasing typically less than 100 individuals. Step1: Download the 1,000 Genomes phase 1 reference panel of haplotypes You can go here to download the 1,000 Genomes phase1 haplotype set in the correct format. File format is described in details here. Step2: Alignment of the SNPs between the GWAS dataset and the reference panel Let assume you need to phase a GWAS dataset defined on a set of SNP A using a reference panel of haplotypes defined on a set of SNP B. SHAPEIT can only phase SNPs that are in both sets A and B with the same REF/ALT alleles. Otherwise: Case1: It throws an error when SNPs found in both A and B have incompatible allele types (strand issues) Case2: It automatically removes SNPs that are in B and not in A Case3: It throws an error when it finds SNPs that are in A but not in B Assuming you use 1,000 Genomes reference panel; these constraints should not be a major problem since most of the chip SNPs are expected to be found. When specifying a reference panel, SHAPEIT proceeds with several checks in order to make sure that alignment between SNPs in A and B is OK. When it encounters any problems; two additional log files are generated describing those. If the main log file is myLogFile.log, then the two following files are generated: myLogFile.snp.strand that describes in detail all Case1 and Case3 problems found. myLogFile.snp.strand.exclude that gives a list of the physical positions of all Case1 and Case3 problems found for easy exclusion using --exclude-snp option. A data check can be run without carrying out phasing using: shapeit -check \ -B gwas \ -M genetic_map.txt \ --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \ --output-log gwas.alignments A content example of the file gwas.alignments.strand when there is an issue: type pos main_id main_A main_B main_flippable ref_id ref_A ref_B ref_flippable Missing Missing 9065164 SNP20-9013164 G T 1 NA NA NA 1 Missing Missing 9068825 SNP20-9016825 T C 1 NA NA NA 1 Missing Missing 9106084 SNP20-9054084 G T 1 NA NA NA 1 Missing Missing 9226397 SNP20-9174397 C T 1 NA NA NA 1 Missing Missing 9288143 SNP20-9236143 C T 1 NA NA NA 1 Missing Missing 9326235 SNP20-9274235 C T 1 NA NA NA 1 Strand Strand 9342728 SNP20-9290728 G T 1 rs76626604 T C 0 Missing Missing 9505264 SNP20-9453264 A C 1 NA NA NA 1 Missing Missing 9533375 SNP20-9481375 C T 1 NA NA NA 1 This file contains one line for each alignment problem found between A and B. The columns are the following: type: the type of the alignment problem. It can be missing for Case3 problems or strand for Case1 problems pos: the physical position of the SNP that has an alignment problem main_id: SNP id in the GWAS dataset main_A: first allele in the GWAS dataset main_B: second allele in the GWAS dataset ref_id: SNP id in the reference panel ref_A: first allele in the reference panel ref_B: second allele in the reference panel A case1 and case3 are highlighted in red and blue respectively in the above example. The case3 example tells you that SNP20-9174397 at position 9226397 is not in the reference panel. The case1example tells that SNP at position 9342728 has different allele types between the gwas and reference panels. You can input this file into R for analysis using: ref_data <- read.table("chr20.alignments.snp.strand", hea=T) In addition to gwas.alignments.strand, another file gwas.alignments.strand.exclude is provided to easily remove the problematic sites from the analysis. Hereafter the file corresponding to the above example: 9065164 9068825 9106084 9226397 9288143 9326235 9342728 9505264 9533375 This file is basically the second column of the previous file without including any header. It can be used in all subsequent SHAPEIT command lines using the --exclude-snp option: shapeit -check \ -B gwas \ -M genetic_map.txt \ --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \ --exclude-snp gwas.alignments.strand.exclude Step3A: Phasing the GWAS dataset using the reference panel of haplotypes Once solved the site alignment issues between the reference and the GWAS panels, you can then proceed with the phasing in itself. To do so, use this command: shapeit -B gwas \ -M genetic_map.txt \ --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \ -O gwas.phased.with.ref All standard phasing options described in the algorithm parameters and model parameters sections are available if you want to to tweak the phasing step. If you want to use only a subset of the haplotypes of the reference panel in the phasing process, you can do it by using the two options --exclude-grp and --include-grp. As described here, the SAMPLE file of the reference panel has two additional columns population and group. You can only use a subset of the reference haplotypes by filtering on these columns. For example, if you want to only use the European reference haplotypes (EUR), you have to first generate an inclusion list containing the groups or populations you want to include: echo "EUR" > group.list And then, you can run SHAPEIT like this: shapeit -B gwas \ -M genetic_map.txt \ --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \ --exclude-snp gwas.alignments.strand.exclude \ --include-grp group.list \ -O gwas.phased.with.ref Step3B: Using the reference panel of haplotypes and no MCMC iterations Alternatively to the previous approaches that relies on MCMC iterations, you can phase each GWAS sample in turn without using any MCMC iteration. To do so, SHAPEIT uses the reference haplotypes as conditioning set and ignores the GWAS haplotypes. We recommend to use this approach only if you have to phase a small number of GWAS sample (typically less than 10). One of the advantage of this approach is that it greatly speeds up the computations. To disable the MCMC algorithm, use the --no-mcmc flag as follows: shapeit -B gwas \ -M genetic_map.txt \ --input-ref reference.haplotypes.gz reference.legend.gz reference.sample \ --exclude-snp gwas.alignments.strand.exclude \ --include-grp group.list \ -O gwas.phased.with.ref \ --no-mcmc Phasing X chromosomes Input file formats for X chromosome data To phase X chromosomes, you need to specify the sex of each sample: A male is coded as "1" A female is coded as "2". In the PED/FAM file format, this is encoded as follows (in purple): IND247 IND247 0 0 1 1 IND248 IND248 0 0 1 1 IND249 IND249 0 0 2 1 IND250 IND250 0 0 2 1 In the SAMPLE file format, an additional column is needed to specify sex. Itneeds to be called sex or gender in the header as shown below: ID_1 ID_2 missing sex 0 0 0 0 IND247 IND247 0 1 IND248 IND248 0 1 IND249 IND249 0 2 IND250 IND250 0 2 Note that SHAPEIT considers all individuals as unrelated when phasing X chromosomes. In theory, it is possible to extend the SHAPEIT model to use both information (ploidy and family), but it has not been implemented yet. Step1: Input data checks As a first step, SHAPEIT checks for haploid heterozygous genotypes in males. A haploid heterozygous is a male genotype that is heterozygous, which is likely an error given the haploid nature of the male X chromosome. Usually, it is due to incorrect sex assignment. SHAPEIT will throw warnings when it detects SNPs or males with high haploid heterozygous proportion (> 1%). Note also that SHAPEIT haploid heterozygous as missing data. In addition, it generates two additional log files that help the user to spot any problems. If the main log file is gwas.phased.log, they are: gwas.phased.snp.hh reports the number of haploid heterozygous per variant site. gwas.phased.ind.hh reports the number of haploid heterozygous per sample. You can run these checks prior to phasing uing: shapeit -check \ -B gwas.chrX \ --output-log gwas.chrX.log \ --chrX An example of the file gwas.phased.snp.hh: id pos haploid_hets n_males rs16985845 10003130 0 20 rs5978365 10006452 1 20 rs5979244 10020945 0 20 rs7880983 10022070 0 20 ... The first columns gives the site ID, the second the physical position, the third the number of haploid heterozygotes found and the fourth the total number of males with genotypes at that site. An example of the file gwas.phased.ind.hh: id male haploid_hets n_snp INDIV_01 0 0 194 INDIV_02 1 2 194 INDIV_03 0 0 194 INDIV_04 1 0 194 ... The columns are individual ID, sex (1 for male), number of haploid heterozygotes and total number of non-missing sites for this sample. Step2.1: Phasing the data without reference panel To set up SHAPEIT in X chromsome phasing mode, just add the --chrX flag to the command line: shapeit -B gwas.chrX \ -M genetic_map.chrX.txt \ -O gwas.chrX.phased \ --chrX It only phases female samples and imputes missing data in male samples. Step2.2: Phasing the data with a reference panel When using a reference panels for X chromsomes, you need some adjustments in the SAMPLE file. Specifically, you need to make sure there is a column sex (1=male / 2=female) in the reference sample file: sample population group sex HG00096 GBR EUR 1 HG00097 GBR EUR 2 HG00099 GBR EUR 2 HG00100 GBR EUR 2 HG00101 GBR EUR 1 HG00102 GBR EUR 2 ... Then, you can phase the data: shapeit -B gwas.chrX \ -M genetic_map.chrX.txt \ -O gwas.chrX.phased \ --chrX \ --input-ref reference.chrX.haplotypes.gz reference.chrX.legend.gz reference.chrX.sample Note that SHAPEIT systematically ignores the second haplotype for males in the reference panel. Note also that it duplicates the male haplotypes in the output for consistency of the HAPS/SAMPLE file format. Read aware phasing We recently published the following paper which presented an extension of the SHAPEIT model that takes advantage of the phase information contained in sequencing reads: O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696 We showed that this extended model leads to improvements in the quality of the resulting haplotypes, especially at rare variants. To incorporate the phase information contained in sequencing into the phasing process, we need to proceed in two steps. Step1: Extract phase informative reads A phase informative read (PIR) is a sequencing read that span at least 2 heterozyguous sites. In the following, we require first that the sequence data is stored in BAM files and second that the genotype data to be phased is in VCF format. We developed a small tool, extractPIRs, to extract the PIRs from BAM files that relies on the samtools API for efficiency. You can download it from here: Linux (x86_64) Static Executable extractPIRs.v1.r68.x86_64.tgz Source code extractPIRs.tgz To illustrate how this process works, lets apply it on the two following example files provided with the SHAPEIT package: sequence.multiple.vcf.gz is the VCF containing the genotype data to be phased from one chromosome (1,022 variants x 322 samples). NA12891.bam and NA12892.bam contain the sequence data for two particular samples in the VCF file. The goal here is to improve the phasing quality of the two sample NA12891 and NA12892 using the available PIRs contained in their sequence data. To extract the PIRs contained in the BAM files, you need first to build a text file that lists all the available BAM files together with some informations that extractPIRs needs. Specifically, each row in this text file corresponds to a particular BAM file to be considered and contains three columns that are either TAB or SPACE delimited: The first column gives the sample ID corresponding to the BAM file. This sample ID has to match one of the sample ID in the VCF in order for extractPIRs to properly map sequencing reads to samples. The second gives the absolute path of the BAM file. This basically tells to extractPIRs where the BAM file is physically located on the computer. The third gives the "Reference sequence" (field @SQ in BAM header) to be considered by extractPIRs. In practice, it tells to extractPIRs to only consider sequencing reads that were mapped to a given chromosome. This is required since phasing works one chromosome at a time, and thus only requires the PIRs for one chromsome at a time. For example, we have included the file in the example\ directory bamlist: NA12891 NA12891.bam 20 NA12892 NA12892.bam 20 Once this is done, you can then run extractPIRs as follows: extractPIRs --bam bamlist \ --vcf sequence.multiple.vcf.gz \ --out myPIRsList NOTE : the VCF file should contain SNPs from only one chromosome. It will produce a text file myPIRsList that contains all the PIRs found in the two BAM files i.e. that span at least two heterozyguous sites in the VCF sequence.multiple.vcf.gz. This file has a specific format designed to feed SHAPEIT in a second processing step. It looks like this: MAP 1022 10000107 T C 10000117 C T ... 10248574 G A 10249534 A G NA12891 88 910 G 32 912 C 26 841 A 30 842 A 31 ... 325 A 30 326 C 27 341 A 28 342 G 30 NA12892 357 450 T 27 452 A 28 453 T 31 207 C 28 208 G 28 ... 394 T 15 395 A 23 396 C 27 12 G 24 13 T 27 The first section (in blue) gives the map of sites onto which PIRs were mapped. It says that there are 1,022 sites in the map and each line that follows gives the physical position and the REF/ALT alleles at that site. The next sections give the PIRs found for each sample. NA12891 (in purple) and NA12892 (in red) have 88 and 357 PIRs, respectively. Each PIR is described as a line in this file. And each heterozyguous site covered is described by a triplet (index of the het in the map, allele observed, base quality). For instance, the last PIR shown in this file covers two hets: the 12th and the 13th sites in the map with alleles G and T respectively, that were sequenced with base qualities 24 and 27. Here, four important points need to be highlighted: Only PIRs at Single Nucleotide Polymorphisms (SNP) are considedered by extractPIRs and SHAPEIT. PIRs at short bi-allelic indels are just ignored. When a sequencing read contains a nucleotide at a heterozyguous site that does not correspond to any of the REF and ALT alleles specified in the VCF file, this nucleotide is just ignored. In other words, only nucleotides in reads that match the REF/ALT alleles in the VCF are considered. You can tell to the program to only consider sequencing read with a mapping quality higher than a given value using the option --read-quality. By default, the minimal mapping quality of a read in order to be considered by extractPIR is 10. You can increase it to 20 by using --read-quality 20 for instance. You can tell to the program to only consider bases with a base quality higher than a given value specified using the option --base-quality. By default, the minimal quality of a base in order to be considered by extractPIRs is 13. You can increase it to 20 by using --base-quality 20 for instance. Here is an example of a custom run of extractPIRs: extractPIRs --bam bamlist \ --vcf sequence.multiple.vcf.gz \ --out myPIRsList \ --base-quality 20 \ --read-quality 20 Step2: phase the VCF using the extracted PIRs Now that you have extracted the PIRs from your BAM files, let use them to phase the VCF as shown below: shapeit -assemble \ --input-vcf sequence.multiple.vcf.gz \ --input-pir myPIRsList \ -O myHaplotypeData This phases the genotype data using the standard MCMC iterative procedure together with the read aware phasing module. Note the use of the -assemble flag to tell to SHAPEIT to use the read aware model and the --input-pir option to specify the set of PIRs to be used (and previously extracted by extractPIRs). This simple example uses all paramaters by default, but you can also tune the MCMC run and the HMM using options described here and here. Compared to a standard run of SHAPEIT, not much changes on the screen apart from this section that describes how SHAPEIT used the PIRs provided: Assembling reads on graphs [382/382] * 445 / 445 PIRs used * 2 / 382 samples covered by PIRs * 63 / 442 heterozygotes covered by PIRs * 107 / 146 graph segments covered by PIRs In brief, it tells you that: 445 PIRs out of the 445 provided are used in the phasing process. 2 samples out of 382 have PIRs. 63 heterozygous genotypes out of 442 possible ones are covered by PIRs. The number 442 corresponds to the total number of hets found in the 2 covered samples. 107 out of 146 graph segments are covered bt PIRs. The graphs are the compact representations of the space of possible haplotypes used by SHAPEIT. In the particular case of just a few samples to be phased, which usually happens in high coverage sequencing studies, you cannot really use the MCMC procedure and need to use an external data that captures the LD patterns used for inference. To do so, you can use the 1000 Genomes haplotypes. Running SHAPEIT in this context is really similar to what is described here. We provided a small example of this with the software package: sequence.single.vcf.gz. This file contains genotype data on three individuals, labelled NA12878, NA12891 and NA12892. So, in the above example, we extracted the PIRs for two of these individuals (NA12891 and NA12892) and stored them in the file myPIRsList. In a first step, you need to check that this VCF and the reference panel align well: shapeit -check \ --input-vcf sequence.single.vcf.gz \ -R reference.haplotypes.gz reference.legend.gz reference.sample \ --output-log myAlignmentChecks.log SNPs that do not align well are written to a file, in this example it is called myAlignmentChecks.snp.strand.exclude This file is used in the following command to remove those SNPs from the analysis. Then, you can run the actual phasing stage: shapeit -assemble \ --input-vcf sequence.single.vcf.gz \ --input-pir myPIRsList \ -R reference.haplotypes.gz reference.legend.gz reference.sample \ -O myHaplotypeData \ --exclude-snp myAlignmentChecks.snp.strand.exclude --no-mcmc Using phase uncertainty In this section, we propose two ways of capturing phase uncertainty. Both of them start from the graph structures that SHAPEIT can produce as output. So the very first step is to produce this graph file gwas.phased.graph as shown here: shapeit -B gwas \ -M genetic_map.txt \ --output-graph gwas.phased.graph Method1 A simple approach to propagate uncertainty in downstream analysis is to sample haplotypes R times from the graphs and then to average your downstream analysis over the R generated haplotype sets. To sample 100 haplotype sets for example, you can use the following command: for r in $(seq 1 100); do shapeit -convert \ --input-graph gwas.phased.graph \ --output-sample gwas.phased.S$r \ --seed $r; done Don't forget to change the seed to make sure that two samples are generated using different random number sequences. You can also output the haplotypes only for a subset of SNPs/individuals you are interested in with the following options: --output-from, --output-to, --exclude-ind, --include-ind, --exclude-snp and --include-snp. For example, to generate haplotypes only for the individuals in gwas.subset.inds at sites in gwas.subset.inds, use this command instead of the previous one: shapeit -convert \ --input-graph gwas.phased.graph \ -O gwas.phased.subset.S$r \ --exclude-ind gwas.subset.inds \ --exclude-snp gwas.subset.site Both files gwas.subset.inds and gwas.subset.site are formatted as described in this section. Note that the graph file format used in SHAPEIT version 2 follows new specifications and requires now just one single binary file to completely store the graphs. In SHAPEIT version 1, the graph file format is different and requires three files. SHAPEIT version 2 can still read this deprecated file format using this command line: shapeit -convert \ --input-graph myData.graph.bin myData.graph.bim myData.graph.fam \ -O myData.phased Old version of the graph file format is implicitly recognised by SHAPEIT when you specify 3 arguments to --input-graph. Method2 A bit more sophisticated method relies on internal resampling. The idea is to extract the most likely pairs of haplotypes with their respective probability for each sample and defined on subsets of variants provied by the user. In practice, it is much faster than the previous method and prevents the need to write scripts that parse the haplotype sets. In this method, the first step is to produce a file that describes the different combinations of variant sites for which the most likely haplotypes have to be extracted. Herefater an example (gwas.sets): /home/olivier/gwas.set1 9000779 9009406 /home/olivier/gwas.set2 9009163 9009905 9011322 /home/olivier/gwas.set3 9009163 9011047 This file contains a variable number of columns (min = 3). The first column gives the filename where to write the results of the procedure and every other value corresponds to the physical position of a variant site we wish to consider. For example, the second line in red tells to SHAPEIT to extract the most likely pair of haplotypes at the three variants with positions 9009163, 9009905 and 9011322. Then to do the job, use the following command: shapeit -convert \ --input-graph gwas.phased.graph \ --input-sets gwas.sets This samples a 1,000 times haplotypes in the graphs, subset them at the provided variant site combinations, report both: For each sample, all the most likely haplotype pair are reported with their respective probability.The reported probabilities correspond to frequencies of occurence in the 1,000 samples. These are written in files with *.pairs.gz extension. The frequency of each distinct haplotype in the dataset. It is derived from the haplotype pair probabilities described above. These are written in files with *.freqs.gz extension. Note that you can change the number of sampling stages with the --replicate option as shown below: shapeit -convert \ --input-graph gwas.phased.graph \ --input-sets gwas.sets \ --replicate 500 In this example, 6 files are generated: set1.freqs.gz, set1.pairs.gz, set2.freqs.gz, set2.pairs.gz, set3.freqs.gz and set3.pairs.gz all in the folder /home/olivier. The content of set1.freqs.gz is as follows: 0 11 0.885337 1 01 0.0349776 2 10 0.0546184 3 00 0.0250673 There are 3 columns in this file: The haplotype index, The haplotype content. A combination of 0s and 1s defining the allelic content of the haplotype. The frequency in the dataset. They sum to 1. The content of the mate file, set1.pairs.gz is as follows: 0 4 HG01881 HG01879 HG01880 0 0.96000 0 1 2 1 0 4 HG01881 HG01879 HG01880 1 0.04000 1 0 3 0 1 4 HG01934 HG01932 HG01933 0 1.00000 1 0 0 0 ... 155 3 HG01261 0 HG01260 0 1.00000 0 0 0 156 2 HG02471 0 0 0 1.00000 0 0 157 2 HG02470 0 0 0 1.00000 0 0 158 2 NA18485 0 0 0 1.00000 0 0 159 2 NA18497 0 0 0 1.00000 0 0 160 2 NA18500 0 0 0 1.00000 0 0 161 2 NA18503 0 0 0 1.00000 0 2 162 2 NA18506 0 0 0 0.47000 1 2 162 2 NA18506 0 0 1 0.53000 0 3 ... In purple, it basically tells that NA18506 has two possible haplotype pairs with indexes (1, 2) and (0, 3) in set1.freqs.gz, and with probabilities 0.47 and 0.53. The first column gives the family index, the second the type of family (2=unrelated, 3=duo and 4=trios), the fourth and fifth the parents when relevant, and the sixth the haplotype pair index. In red and blue, two examples of what you get in case of trio and duo, respectively. Here, founder haplotype triplets or quadruplets are reported instead of pairs. Genotype calling from low coverage sequencing (1000G pipeline) We recently published the following paper which presents an extension of the SHAPEIT model that allows to call genotype from low coverage sequencing data when a chip-derived haplotype scaffold is available for the same set of samples: O. Delaneau, J. Marchini & The 1000 Genomes Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications. We showed that this method is able to produce an improved panel of haplotypes for the 1000 Genomes project (both for Phase1 and Phase3) that lead to increased downstream imputation quality. Below we illustrate the use of this new functionnality on a small example (100kb) extracted from the 1000G project phase 1 data. The parameter values we use in this example are the same as used in 1000 Genomes project, and the values we recommend for general use. Note that the current pipeline only works for bi-allelic variants. The pipeline involves first running Beagle to get a quick initial set of genotype calls. SHAPEIT then takes these initial calls, together with the phased haplotype scaffold, and produces a highly accurate set of calls genotypes and haplotypes on the sequencing data. Example files The data files used throughout this example can be downloaded from here. This includes the following files The Genotype Likelihoods (GLs). A GL is the probability of observing the sequencing reads at a particular variant site given the unknown underlying genotype. To calculate them from sequence data contained in BAM files, you can use software such as samtools, snptools, freebayes or GATK. After having used these software, you should get a VCF file conatining a GL field for each site similarly to the example file GLs.vcf.gz. The genetic map. This map is similar the one described here and in the example file map.txt.gz. The haplotype scaffold derived from SNP array data. This data set needs to contain some overlap with the set of sequence data in term of both samples and variant sites. The larger the overlap, the better. This data set also needs to be phased as accurately as possible, so it is important to take advantage of any useful properties of the data, such as large sample sizes or family relationships. The scaffold corresponding to the example data set can be found in scaffold.haps.gz and scaffold.haps.sample. Step 1: Define chromosome chunks for Beagle4 First, download the source code of makeBGLCHUNKS from here. Then, compile by typing make in the terminal. Note that compilation needs a recent version of boost libraries. This piece of software builds a file from a given VCF that contains the chunk coordinates required to run Beagle 4 in each chunk. Feed makeBGLCHUNKS with GLs.vcf.gz and two parameters: The chunk size (--window) in number of variant sites, The overlap size (--overlap) in number of variant sites. A chunk therefore contains L = W + 2 x O variant sites. The window and overlap sizes determine the amount of RAM needed by a single Beagle4 job. Make sure that a chunk spans several hundreds kilobases (i.e. site density x L). Here is an example command line to run makeBGLCHUNKS on GLs.vcf.gz: makeBGLCHUNKS --vcf GLs.vcf.gz --window 700 --overlap 200 --output chunk.coordinates This produces a file chunk.coordinates with the following content: 22 20000085 20064615 22 20037290 20099942 This says that the first chunk of chr22 starts at position 20000085 and ends at position 20064615. Step 2: Run Beagle4 First download Beagle4 from here. Then run it on all the chunks from step2 as follows: while read line; do chr=$(echo $line | awk '{ print $1 }') from=$(echo $line | awk '{ print $2 }') to=$(echo $line | awk '{ print $3 }') ivcf=GLs.vcf.gz ovcf=output.beagle4.$chr\.$from\.$to java -Xmx4g -jar b4.r1196.jar gl=$ivcf out=$ovcf chrom=$chr\:$from\-$to done < chunk.coordinates Note that it seems there is an incompatibility between zlib libraries used in Beagle4 and in BOOST on some platforms. This involves either the last line of the file being skipped or a segfault. To fix this issue, you can just recompress the Beagle4 output files using gzip as follows: zcat output.beagle4.20.20000085.20064615.vcf.gz | gzip -c > tmp.vcf.gz mv tmp.vcf.gz output.beagle4.20.20000085.20064615.vcf.gz zcat output.beagle4.20.20037290.20099942.vcf.gz | gzip -c > tmp.vcf.gz mv tmp.vcf.gz output.beagle4.20.20037290.20099942.vcf.gz Step 3: Convert Beagle4 output in SHAPEIT input You need another C++ program: prepareGenFromBeagle4. Download it from here. And compile it using make. This program merges all the VCFs produced by Beagle4 in step 3 and produces proper whole chromosome input files for SHAPEIT. More specifically, you get: A gen/sample pair of files containing a mixture of GLs and fixed genotypes. All genotypes with beagle4 posteriors above a speficied threshold are fixed. And all genotypes that are not still have GLs as input data onto which SHAPEIT will work. A hap/sample pair of files containing the Beagle4 haplotypes in Impute2 format. These haplotypes are used as a starting point for the MCMC used in SHAPEIT. To convert the 2 VCFs you get out of the Beagle4 runs, proceed as follows: prepareGenFromBeagle4 --likelihoods GLs.vcf.gz\ --posteriors output.beagle4.22.*.vcf.gz\ --threshold 0.995\ --output input.shapeit.22 Note All Beagle4 output files for a given chromosome are specified using output.beagle4.20.*.vcf.gz. Moreover, we used 0.995 as the threshold meaning that all genotypes with a posterior above 0.995 are directly fixed and will only need phasing in the SHAPEIT step. Conversely, all genotypes with a posterior below 0.995 are not fixed and will need both calling and phasing. In output, you'll get the 4 files described above with input.shapeit.20 as prefix. Step 5: Run SHAPEIT Here is what you need to run SHAPEIT: The genetic maps for the chromosome you're phasing The chunk boundaries for each chromosome. We advice to use 1Mb chunks for SHAPEIT with 200kb overlap with the flanking chunks. In total, your chunks should be 1.4Mb long. The haplotype scaffold derived from SNP arrays on the same set of samples that has been sequenced (see the Nature communication paper for more details). The scaffold must be whole chromosome phased in a previous run of SHAPEIT for example (See sections of the documentation about phasing SNP array data) taking advantage of any family data when possible. The genotype/GL file preparated in step 4. The haplotypes for MCMC initialization prepared in step 4. Then, run it as follows: shapeit -call\ --input-gen input.shapeit.20.gen.gz input.shapeit.20.gen.sample\ --input-init input.shapeit.20.hap.gz input.shapeit.20.hap.sample\ --input-map map.txt.gz\ --input-scaffold scaffold.haps.gz scaffold.haps.sample\ --input-thr 1.0\ --thread 8\ --window 0.1\ --states 400\ --states-random 200\ --burn 0\ --run 12\ --prune 4\ --main 20\ --output-max output.shapeit.20.haps.gz output.shapeit.20.haps.sample\ --output-log output.shapeit.20.log\ --input-from 20000000\ --input-to 20100000 Some brief descriptions of the arguments: -call tells to SHAPEIT to use the genotype calling functionnality --input-gen specifies the genotype/GL input data that you get out of the Beagle4 step. --input-init specifies the haplotypes that you get out of the Beagle4 step. --input-map specifies the genetic map used in the estimation. --input-scaffold gives the SNP-array derived haplotype scaffold used by SHAPEIT. It has to be in Impute2 format. --thread tells the number of threads to be used in the estimation. --window 0.1 says that window of 0.1Mb have to be used for the SHAPEIT2 model. It's much shorter than what we use for SNP array data but gives better accuracy in practice. For sequence data, do not go above 0.5Mb. --states 400 says that 400 conditionning haplotypes are used in the estimations. These are chosen to be closely related to the individual being estimated. --states-random 200 says that 200 additional conditionnal haplotypes are used in the estimations. Those are randonly chosen which has been shown to improve MCMC mixing. --burn 0 says that no burn-in iterations is used since we already start from a good first estimation provided by beagle4. --run 12 says that 12 pruning stages are used to make the haplotype graphs more parsimonious. --prune 4 says that only 4 iterations are used in each pruning stage. --main 20 says that final estimates are obtained by averaging over 20 iterations. --output-max is to specify the output haplotypes/genotypes estimated by SHAPEIT. --input-from and --input-to specify the region to be phased. Step 6: Ligate SHAPEIT chunks To ligate together all haplotype chunks produced by SHAPEIT, we provide a C++ program: ligateHAPLOTYPES. Download it from here, and compile it using make. You need to run it as follows: ligateHAPLOTYPES --vcf GLs.vcf.gz\ --scaffold scaffolded_samples.txt\ --chunks s2.chunk1.hap.gz s2.chunk1.hap.gz s2.chunk1.hap.gz\ --output s2.ligated.hap.gz This command will basically ligate the haplotype chunks in s2.chunk1.hap.gz, s2.chunk1.hap.gz and s2.chunk1.hap.gz into a unique haplotype file s2.ligated.hap.gz. Phase between successive segments is automatically determined by aligning the heterozyguous genotypes in the overlaps for all samples that are not listed in the file scaffolded_samples.txt. Haplotypes for all samples specified in scaffolded_samples.txt are ligated using simple concatenation. Options SHAPEIT can be run in 4 different modes: -phase (default) This mode can be used to phase genotype data with or without a reference panel. -convert This modes can be used to convert and manipulate the output of SHAPEIT. -check This mode can be used to check the validity of the input data. -assemble This mode can be used to phase genotype data using phase informative reads. These different modes are triggeredas follows: shapeit or shapeit -phase //phasing mode shapeit -convert //convertion mode shapeit -check //check data mode shapeit -assemble //phasing mode with PIRs Basic --help Print help about the -phase mode on screen. --version Print the version of SHAPEIT. --source Print the version of each SHAPEIT module --licence Print on screen the licence In the following tables of options we have indicated which options work with each of the 4 modes. PH = -phase, CH = -check, CO = -convert, AS = -assemble Input P H C H C O A S --input-bed file.bed file.bim file.fam --input-bed file -B file.bed file.bim file.fam -B file x x Unphased genotypes in Plink Binary BED/BIM/FAM format: * file.bed: binary genotype file * file.bim: SNP map file * file.fam: individual information file Brief description here. Detailed description here. --input-ped file.ped file.map --input-ped file -P file.ped file.map -P file x x Unphased genotypes in Plink Text PED/MAP format: * file.ped: text genotype file * file.map: SNP map file Brief description here. Detailed description here. --missing-code [string] x x Specifies the character used to encode missing data in PED/MAP format. The default value is 0. For example, missing data set as "N" as in GTOOL can be specified with --missing-code N More details here. --input-gen file.gen file.sample --input-gen file -G file.gen file.sample -G file x x Unphased genotypes in GEN/SAMPLE format: * file.gen: text genotype file * file.sample: individual information file Brief description here. Detailed description here. --input-thr [float] x x Threshold for hard calling genotypes when GEN/SAMPLE files are given as input. For each individual at each SNP, the program will assume that the most likely genotype is true if its probability exceeds the threshold. Otherwise, the genotype will be considered as missing. The default value is 0.9. More details here. Note that this option has changed between SHAPEIT1 and SHAPEIT2 and is no more called --input-gen-threshold. --input-vcf file.vcf -V file.vcf x x x Unphased genotypes in VCF format v4.1. Detailled description of the format is here. --input-map file.gmap -M file.gmap x x x x Input genetic map file in Text format with a header line and 3 fields for each SNP: 1. A physical position in bp 2. A rate in cM/Mb (could be any value) 3. A genetic position in cM Details on the file format can be found here. And details on the usage here. --input-ref ref.haps ref.legend ref.sample -R ref.haps ref.legend ref.sample x x x x Reference panel of haplotypes in the Impute2 file format. 1. reference.haps: the reference haplotypes 2. reference.legend: the snp map 3. reference.sample: the individual infos Details on the file format can be found here. Reference panels can be downloaded from here. --chrX x x Specifies to SHAPEIT that the genotypes to be phased come from the non pseudo autosomal region of the X chromosome. SHAPEIT looks therefore at the sex of each individual to determine the ploidy model to use. More details here. --noped x x Discard all pedigree/family information in the input files. SHAPEIT treats all samples as unrelated when this flag is specified. --input-pir file.pirs x Sets of Phase Informative reads produced by extractPIRs. More details here. --input-graph file.graph x Read haplotype graph file generated by a SHAPEIT run with the option --output-graph. More details here and here. --input-haps file.haps file.sample --input-haps file x Read haplotypes estimated by a SHAPEIT run with the option --output-max file.haps files.sample. More details here. --input-sets file.sets x List of combinations of variant sites for which pairs of haplotypes have to be produced. More details here. --replicate INT x Number of haplotype pairs sampled per individual. Default is 1,000. More details here. Filtering P H C H C O A S --input-from [int] x x x x Only consider genotypes with physical position above the specified value in the phasing process. Details here. --input-to [int] x x x x Only consider genotypes with physical position below the specified value in the phasing process. Details here. --output-from [int] x x x x Write haplotypes only for SNPs with position above the specified value. Details here. --output-to [int] x x x x Write haplotypes only for SNPs with position below the specified value. Details here. --exclude-ind file.exc x x x x Read genotypes for individuals NOT included in the list of IDs file.exc. The provided file contains a line for each individual to exclude (by ID). Details here. --include-ind file.inc x x x x Read genotypes only for individuals included in the list of IDs file.inc. The provided file contains a line for each individual to include (by ID). Details here. --exclude-snp file.exc x x x x Read genotypes for SNPs NOT included in the list of positions file.exc. The provided file contains a line for each SNP to exclude (by physical position). Details here. --include-snp file.inc x x x x Read genotypes for SNPs included in the list of positions file.inc. The provided file contains a line for each SNP to include (by physical position). Details here. --exclude-grp file.exc x x x Read haplotypes in the reference panel specified with --input-ref only for individuals with both group or population IDs NOT included in the list of IDs file.exc. The provided file contains a line for each group or population to exclude (by ID). Details here. --include-grp file.inc x x x Read haplotypes in the reference panel specified with --input-ref only for individuals with either group or population IDs included in the list of IDs file.inc. The provided file contains a line for each group or population to exclude (by ID). Details here. Model P H C H C O A S --states [int] -S [int] x x To specify the number of conditioning haplotypes to be used during the phasing process. The running time of the algorithm increases only linearly with this number. The default value is 100. More details here. --window [float] -W [float] x x Mean size of the windows in Mb in which conditioning haplotypes are defined. This parameter has impact on the accuracy. The default value is 2 which is ok for most GWAS application. We advice a 0.5 value when dealing with genotypes derived from sequencing. More details here. --model-version1 x x Use the graphical model to represent the conditioning haplotypes (SHAPEIT v1). More details here. --effective-size [int] x x The effective size of the population you want to phase. Use the following values: * For European: 11418 * For African: 17469 * For Asian: 14269 * For Mixed sample: between 11418 and 17469, depending on the proportion of each population in the dataset. More details here. --rho [float] x x Constant recombination rate value used when no genetic map has been specified using --input-map. More details here. MCMC P H C H C O A S --burn [int] x x The number of burn-in iterations used by the algorithm to reach a good starting point. The default is to use 7 burn-in iterations. More details here. --prune [int] x x The number of pruning iterations used by the algorithm to find a parsimonious graph for each individual. The default is to use 8 pruning iterations. More details here. --main [int] x x The number of iterations used by the algorithm to compute transition probabilities in the haplotype graphs. The default is to use 20 pruning iterations. More details here. --thread [int] -T [int] x x x The number of individuals whose phase is updated in parallel using different threads. If you have 4 cores in your computer, you can use --thread 4 to divide the running times by almost 4. The dafult value is 1. More details here. --seed [int] x x x x The seed of the random number generator. Default is the value given by stdlib function time(NULL). More details here. --no-mcmc x x Disables the MCMC iterations. This needs a reference panel to be specified. More details here. Output P H C H C O A S --output-max file.haps file.sample --output-max file -O file.haps file.sample -O file x x Provides the most likely pair of haplotypes for each individual in Impute2 format. All extra informations (phenotypes/sex/coveriates) present in the input file are added to the SAMPLE file. More details on the file format specifications can be found here. And more details about this options here. --output-graph file.hgraph x Provides the haplotype graph for each individual in the sample once the phasing process done. Can be read with SHAPEIT when using the -convert mode. More details about this options here and here. --output-log file.log -L file.log x x x x To specify the log file where the screen output is copied. It also gives the prefix of all the files generated by SHAPEIT when checking input data. By default, the file is called shapeit_[date]_[time]_[UUID].log. More details here. --output-haps file.haps file.sample x Haplotypes in Impute2 format produced in the convert mode of SHAPEIT. More details here. --output-vcf file.vcf x Haplotypes in VCF format produced in the convert mode of SHAPEIT. More details here. --output-ref file.haps file.leg file.sample x Haplotypes in Impute2 reference panel format produced in the convert mode of SHAPEIT. More details here. File formats SHAPEIT reads and outputs various different files with different formats. Click the links below for details of each format Unphased genotypes (PED/MAP) Unphased genotypes (BED/BIM/FAM) Unphased genotypes (GEN/SAMPLE) Genetic map (GMAP) Phased haplotypes (HAPS/SAMPLE) Reference panel (HAP/LEG/SAMPLE) Plink PED / MAP file format for Unrelateds & Nuclear families This is a default text file format used by Plink. This format is useful if you want to edit the files in a text editor. However, it requires a substantial HDD usage for large datasets and thus longer running times to parse it. A detailed description of the ped/map file format can be found here. To describe it here briefly, consider the following individuals: 2 Unrelateds (UNR1 & UNR2). 1 Trio (father=TRIOF, mother=TRIOM & child=TRIOC). 1 Duo (parent=DUOP & child=DUOC). All the individuals were typed on 3 SNPs (SNP1, SNP2 & SNP3) giving the following genetic data: SNP1 SNP2 SNP3 UNR1 A/A T/C ?/? UNR2 A/G T/C A/T TRIOF A/G T/C A/T TRIOM A/G T/C A/T TRIOC A/A C/T A/T DUOP G/A T/C A/A DUOC A/A T/C A/A PED file The PED file describes the individuals and the genetic data. The PED file corresponding to the example dataset is: FAM1 IND1 0 0 1 0 A A T T 0 0 FAM2 IND2 0 0 1 0 A G T C T A FAM3 TRIOF 0 0 1 0 A G T C A T FAM4 TRIOM 0 0 2 0 A G T C A T FAM5 TRIOC TRIOF TRIOM 1 0 A A C T A T FAM6 DUOP 0 0 2 0 G A T C A A FAM7 DUOC DUOP 0 2 0 A A T C A A This file can be SPACE or TAB delimited. Each line corresponds to a single individual. The first 6 columns are: Family ID [string] Individual ID [string] Father ID [string] Mother ID [string] Sex [integer] Phenotype [float] Columns 7 & 8 code for the observed alleles at SNP1, columns 9 & 10 code for the observed alleles at SNP2, and so on. Missing data are coded as "0 0" as for example for SNP3 of IND1. This file should have N lines and 2L+6 columns, where N and L are the numbers of individuals and SNPs contained in the dataset respectively. Each individual must have an unique ID containing only alphanumeric characters. MAP file The MAP file describes the SNPs. The MAP file corresponding to the example dataset is: 7 SNP1 0 123 7 SNP2 0 456 7 SNP3 0 789 This file can be SPACE or TAB delimited. Each line corresponds to a SNP. The 4 columns are: Chromosome number [integer] SNP ID [string] SNP genetic position (cM) [float] SNP physical position (bp) [integer] This file should have L lines and 4 columns, where L is the number of SNPs contained in the dataset.Each SNP must have a unique physical position. All the SNPs must be ordered by physical position. PED/MAP to BED/BIM/FAM conversion To convert myPlinkTextData.ped and myPlinkTextData.map in Plink binary format, use Plink as follows: plink --file myPlinkTextData --make-bed --out myPlinkBinaryData PED/MAP to GEN/SAMPLE conversion To convert myPlinkTextData.ped and myPlinkTextData.map in GEN/SAMPLE format, use the software package GTOOL that you can find here as follows: gtool -P --ped myPlinkTextData.ped --map myPlinkTextData.map --og myGtoolTextData.gen --os myGtoolTextData.sample Note that GTOOL has more options to tweak the conversion between formats. Plink BED / BIM / FAM file format for Unrelateds & Nuclear families This is a default binary file format used by Plink. This format is very convenient for large datasets since it requires low HDD memory usage. You can find a precise description of the Plink binary format here. If you want to use this format, we advice to first format your data in PED/MAP format and then use Plink to convert them in bed/bim/fam To describe it here briefly, consider the following individuals: 2 Unrelateds (UNR1 & UNR2). 1 Trio (father=TRIOF, mother=TRIOM & child=TRIOC). 1 Duo (parent=DUOP & child=DUOC). All the individuals were typed on 3 SNPs (SNP1, SNP2 & SNP3) giving the following genetic data: SNP1 SNP2 SNP3 UNR1 A/A T/C ?/? UNR2 A/G T/C A/T TRIOF A/G T/C A/T TRIOM A/G T/C A/T TRIOC A/A C/T A/T DUOP G/A T/C A/A DUOC A/A T/C A/A BIM file The BIM file describes the SNPs. The BIM files corresponding to the example dataset is: 7 SNP1 0 123 A G 7 SNP2 0 456 T C 7 SNP3 0 789 A T This file is SPACE or TAB delimited. Each line corresponds to a single SNP. The first six columns are: Chromosome number [integer] SNP ID [string] SNP genetic position (cM) [float] SNP physical position (bp) [integer] First allele [string] Second allele [string] This file should have L lines where L is the number of SNPs in the dataset. Each SNP must have a unique physical position. All the SNPs must be ordered by physical position. FAM file The FAM file describes the individuals. The FAM files corresponding to the example dataset is: FAM1 IND1 0 0 1 0 FAM2 IND2 0 0 1 0 FAM3 TRIOF 0 0 1 0 FAM4 TRIOM 0 0 2 0 FAM5 TRIOC TRIOF TRIOM 1 0 FAM6 DUOP 0 0 2 0 FAM7 DUOC DUOP 0 2 0 This file is basically the 6 first columns of the corresponding PED file. This file is SPACE or TAB delimited. Each line corresponds to a single individual. The first six columns are: Family ID [string] Individual ID [string] Father ID [string] Mother ID [string] Sex [integer] Phenotype [float] This file should have N lines and 6 columns where N is the number of individuals.Each individual must have an unique ID containing only alphanumeric characters. BED file A detailed description of the BED format can be found here. BED/BIM/FAM to PED/MAP To convert back myPlinkBinaryData.bed, myPlinkBinaryData.bim and myPlinkBinaryData.fam in Plink text format, use Plink as follows: plink --bfile myPlinkBinaryData --recode --out myPlinkTextData Oxford GEN / SAMPLE file format for Unrelateds This is a default text file format used in the analysis pipeline of the WTCCC genetic studies. This format is now widely used by a variety of tools. A detailed description of the gen/sample file format can be found here. To describe it here briefly, consider 4 unrelateds (UNR1, UNR2, UNR3 & UNR4) that were typed on 3 SNPs (SNP1, SNP2 & SNP3): SNP1 SNP2 SNP3 UNR1 A/A T/C ?/? UNR2 A/G C/C A/T UNR3 A/G T/T A/T UNR4 A/G T/C T/T GEN file The GEN file contains the genetic data and describes the SNPs. The GEN file corresponding to the example dataset is: 7 SNP1 123 A G 1 0 0 0 1 0 0 1 0 0 1 0 7 SNP2 456 T C 0 1 0 0 0 1 1 0 0 0 1 0 7 SNP3 789 A T 0 0 0 0 1 0 0 1 0 0 0 1 This file is SPACE delimited. Each line corresponds to a single SNP. The first 5 columns are: Chromosome number [integer] SNP ID [string] SNP physical position (bp) [integer] First allele [string] Second allele [string] Then the successive column triplets (6, 7, 8), (9, 10, 11), (12, 13, 14) and (15, 16, 17) correspond to the genotypes of the individuals at the first, second, third and fourth SNP. A genotype is coded as a triplet (AA, AB, BB). For example "1 0 0" means that the genotype is A/A, and "0 1 0" that the genotype is A/B. Missing data is coded as "0 0 0" as for IND1 x SNP3. For a single SNP, the genotypes are given in the same order than in the SAMPLE file (see below). This file should have L lines and 3N+5 columns where L and N are the numbers of SNPs and individuals respectively.Each SNP must have a unique physical position. All the SNPs must be ordered by physical position. SAMPLE file The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is: ID_1 ID_2 missing 0 0 0 UNR1 UNR1 0 UNR2 UNR2 0 UNR3 UNR3 0 UNR4 UNR4 0 It is SPACE delimited. The first two lines are header lines that describe the content of the file. Then, each line corresponds to a single individual. The first three columns are: 1. First individual ID 2. Second individual ID 3. Missing data proportion Additional columns usually describe other information for the individuals as some phenotypes for example or covariates. This file should have N+2 lines and at least 3 columns where N is the number of individuals. Each individual must have unique IDs containing only alphanumeric characters. GEN/SAMPLE to PED/MAP To convert myGtoolTextData.gen and myGtoolTextData.sample in Plink PED/MAP format, use GTOOL as follows: gtool -G --g myGtoolTextData.gen --s myGtoolTextData.sample --ped myPlinkTextData.ped --map myPlinkTextData.map Note that genotypes with a probability below the --threshold value are set as missing. GTOOL codes missing genotypes using "N" characters. You can use Plink to recode them using "0" as follows: plink --file myPlinkTextData --missing-genotype N --output-missing-genotype 0 --recode --out myPlinkTextData2 The genetic map You can find genetic maps for Human here: HapMap phase II b36 HapMap phase II b37 The genetic map must have a header line and the three following SPACE delimited columns: The physical position (bp) [integer] The recombination rate (cM/Mb) [float] The genetic position (cM) [float] The resulting genetic map file should look like that: pposition rrate gposition 72765 0.12455 0.00000 94172 0.12458 0.00266 94426 0.12461 0.00269 95949 0.12461 0.00288 98087 0.12460 0.00315 98481 0.12468 0.00320 111955 0.12475 0.00488 113224 0.12476 0.00504 113934 0.12481 0.00513 ... ... ... 135443104 6.25739 182.52442 135447971 6.29614 182.55506 135499163 10.89782 183.11294 HAPS / SAMPLE file format for haplotypes This is a default haplotype text file format used by the analysis pipeline of the WTCCC genetic studies. This format is used for example by IMPUTE2. To describe it briefly, consider 4 Unrelateds (IND1, IND2, IND3 & IND4) that were phased on 3 SNPs (SNP1, SNP2 & SNP3): SNP1 SNP2 SNP3 IND1 hap1 A T A IND1 hap2 A C T IND2 hap1 G C T IND2 hap2 A T A IND3 hap1 A T T IND3 hap2 A C T IND4 hap1 G T T IND4 hap2 G C T SAMPLE file The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is: ID_1 ID_2 missing 0 0 0 UNR1 UNR1 0 UNR2 UNR2 0 UNR3 UNR3 0 UNR4 UNR4 0 It is SPACE delimited. The first two lines are header lines that describe the content of the file. Then, each line corresponds to a single individual. The first three columns are: First individual ID Second individual ID Missing data proportion Additional columns usually describe other information for the individuals as some phenotypes for example or covariates. This file should have N+2 lines and at least 3 columns where N is the number of individuals. Each individual must have unique IDs containing only alphanumeric characters. HAPS file The HAPS file describes the SNPs and contains the haplotypes. The HAPS file corresponding to the example dataset is: 7 SNP1 123 A G 0 0 1 0 0 0 1 1 7 SNP2 456 T C 0 1 1 0 0 1 0 1 7 SNP3 789 A T 0 1 1 0 1 1 1 1 This file is SPACE delimited. Each line corresponds to a single SNP. The first five columns are: Chromosome number [integer] SNP ID [string] SNP Position [integer] First allele [string] Second allele [string] Then the successive column pair (6, 7), (8, 9), (10, 11) and (12, 13) corresponds to the alleles carried at the 4 SNPs by each haplotype of a single individual. For example a pair "1 0" means that the first haplotype carries the B allele while the second carries the A allele. The haplotypes are given in the same order than in the SAMPLE file. This file should have L lines and 2N+5 columns, where L and N are the numbers of SNPs and individuals respectively. HAP / LEGEND / SAMPLE file format for haplotype reference panels This is a default haplotype text file format used by IMPUTE2 to encode the haplotypes of the 1,000 Genomes Project. Precise details about the file format can be found here. To describe it briefly, consider 4 unrelated individuals (Two white American CEU1 & CEU2, an English GBR1 & a Nigerian YRI1) for which haplotypes are available at 3 SNPs (SNP1, SNP2 & SNP3): SNP1 SNP2 SNP3 CEU1 hap1 A T A CEU1 hap2 A C T CEU2 hap1 G C T CEU2 hap2 A T A GBR1 hap1 A T T GBR1 hap2 A C T YRI1 hap1 G T T YRI1 hap2 G C T SAMPLE file The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is: sample population group sex CEU1 CEU EUR 1 CEU2 CEU EUR 2 GBR1 GBR EUR 2 YRI1 YRI AFR 2 It is SPACE delimited. The first line is a header line that describe the content of the file. Then, each line corresponds to a single individual. The first four columns are: Individual ID Group1 ID Group2 ID Sex (1 for male, 2 for female) SHAPEIT requires at least 4 columns in the SAMPLE file. Additional columns can be added but SHAPEIT will ignore them. This file should have N+1 lines and at least 4 columns where N is the number of individuals in the reference panel.Each individual must have unique IDs containing only alphanumeric characters.Each individual can have 2 associated group IDs for subsetting purpose. LEGEND file The LEGEND file describes the SNPs. The minimal LEGEND file corresponding to the example dataset is: id position a0 a1 SNP1 123 A G SNP2 456 T C SNP3 789 A T This file is SPACE delimited. The first line is a header line that describe the content of the file. Each line corresponds to a single SNP. The first four columns are: SNP ID [string] SNP Position [integer] First allele [string] Second allele [string] SHAPEIT requires at least 4 columns in the SAMPLE file. Additional columns can be added but SHAPEIT will ignore them. This file should have L+1 lines and at least 4 columns where L is the number of SNPs in the reference panel. HAP file The HAP file contains the haplotypes. The HAP file corresponding to the example dataset is: 0 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 This file is SPACE delimited. Each line corresponds to a single SNP. Each successive column pair (0, 1), (2, 3), (4, 5) and (6, 7) corresponds to the alleles carried at the 4 SNPs by each haplotype of a single individual. For example a pair "1 0" means that the first haplotype carries the B allele while the second carries the A allele as specified in the LEGEND file. The haplotypes are given in the same order than in the SAMPLE file. This file should have L lines and 2N columns, where L and N are the numbers of SNPs and individuals respectively. Version history v2.r900 (17 January 2018) We have adjusted the algorithm to (a) ensure that when missing genotypes in duos/trios are imputed as part of the phasing they are Mendel consistent (b) improve the accuracy of imputation of missing genotypes. v2.837 (17 August 2015) Fixes some bugs when running -assemble mode v2.790 (16 June 2014) New functionality for calling genotypes and phasing from sequencing data (1000GP pipeline) We have published the following paper on using SHAPEIT2 for calling genotypes and phasing from sequencing data. This method is being used within the 1000 Genomes Project. O. Delaneau, J. Marchini, The 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications 5 3934 [LINK] The new functionality, and the pipeline used for the 1000 Genomes project, is available here. Bug fixes We have fixed the following bugs The -assemble mode will now work when some individuals in the PIRs file are not in the genotype file. The -convert mode will now work when converting data to the hap/legend/sample format by giving the --output-ref option 3 arguments. Monomorphic sites in PLINK BED format files don't now have the alternate allele coded as 0, which implies some incompatibility with Impute2. v2.778 (24 Mar 2014) The website for SHAPEIT has been re-designed (hope you like it) and is now hosted at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html The new version of SHAPEIT (v2 778) contains two new major features Functionality for using phase information in sequencing reads to help phasing. This method is primarily focused on phasing high-coverage sequencing datasets, from which genotypes have already been called. Our approach extracts phase-informative reads from a set of BAM files, and uses this information to aid the phasing, within the SHAPEIT approach. This approach helps especially with phasing low-frequency variants. In fact, the method can phase singleton sites, that are covered by phase-informative reads, with high accuracy. The details of the method have been published here. O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696 [LINK] See here for more detail instruction on these features. Functionality for phasing in general pedigrees. In the following paper we investigated the use of SHAPEIT for phasing related samples. J. O'Connell, D. Gurdasani, O. Delaneau, et al. (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics [LINK] The conclusions of this paper are that SHAPEIT can be an accurate method of phasing across a whole spectrum of relatedness, from explicitely related families or pedigrees, through closely related samples such as those found in isolated populations, to unrelated samples. Our new approach has two steps. Firstly, all the samples are phased ignoring the pedigree information. This phasing is very accurate despite not using the pedigree information. The reason is that the sharing of long-range haplotypes between related samples actually helps the phasing within the MCMC algorithm that SHAPEIT implements. Secondly, we have developed a new method (called duoHMM) that takes the estimated haplotypes from the first step, and combines them with the known pedigree information, to further improve the phasing. This method will ensure that the haplotypes are consistent with the pedigree structure, leading to increased accuracy. There is a single, simple option (--duohmm) used to active this method. See here for more detailed instructions. A side product of the duoHMM approach is that it can be used to detect recombination events and non-mendelian genotyping errors. For these features we provide standalone software. v2.644 (4 Dec 2012) The main improvement in SHAPEIT version 2 resides in the new model used for the conditioning haplotypes. It is a generalisation of the "surrogate family" phasing idea that can now be applied at the whole chromosome scale. It results in improved accuracy and speed compared to version 1. This new model is used by default and can be controlled via two parameters, the average window size (W) and the number of haplotypes to be considered per window (K). Default values for these parameters can be changed using the options --window and --states respectively. The graph representation of the conditioning haplotypes (the underlying model of version 1) can still be used if you add --version1 flag to the SHAPEIT command line. * The SHAPEIT2 model is now used by default with --states 100 and --window 2. * The SHAPEIT1 model can still be used if the --version1 flag is added to the command line. * Option --chrX to phase X chromosomes has been improved: better compatibility with Impute2, more checks for haploid heterozygous, possibility of using a reference panel. * Option --input-ref to phase using a reference panel has been improved: SNP alignment between the study genotypes and the reference panels is much easier. * Subsets of the reference panel can be ignored using --exclude-grp and --include-grp options. It makes things easier if you want to consider only a subset of the reference haplotypes. * MCMC iterations can be discarded when using a reference panel with --no-mcmc flag. Each individual is phased in turn using only the reference panel of haplotypes. * Default number of iterations has been halved from 70 [10B+10P+50M] to 35 [7B+8P+20M]. * Haplotype graphs now requires a single file to be stored, instead of 3 as in version 1. * Verbose on the screen has been clarified. * --version issue has been fixed * To reduce command line length; (a) many available options have now a short name (ex: --input-bed can be written as -B), and (b) file set with common prefix can be specified only with the prefix (ex: --input-bed file.bed file.bim file.fam can be written as --input-bed file) * A new mode has been added: shapeit -check [options]. It contains all functions related to data verification. It reads all input files and spot any problems related to high call rates, Mendel error rates or haploid heterozygous. Summary statistics are given in several log files automatically generated. * A new mode has been added: shapeit -convert [options]. It contains useful functions to manipulate SHAPEIT output files; (a) haplotype graph files generated using --output-graph and (b) haplotypes generated using --output-max. v1.ESHG (23 June 2012) * Major: New option --chrX to phase X chromosomes. * Major: New option --input-ref to phase using a reference panel of haplotypes (1KGP). * Major: More complete log files with many useful statistics that can be direclty input in R about missingness per snp/individual, allele frequencies, mendel errors, heterozygous haploids, etc ... * Minor: All input files can be given as Gzipped or Bzipped as soon as the correct file extension is given (.gz and .bz2). They will be internally decompressed. * Minor: All output files can be internally compressed using Gzip or Bzip2 as soon as the correct file extension is given (.gz and .bz2). * Minor: Any information in the input sample file are propagated into the output sample file as the phenotypes for example. Idem for PED and FAM files. * Minor: Verbose on the screen was reduced. * Minor: Unphased genotype files can be given in a more parsimonious way. For example --input-bed file.bed file.bim file.fam can be now written as --input-bed file. v1.r532: * Minor: bugfix in trio/duo phasing. v1.r416: * Minor: When running on ped/map files at monomorphic SNPs the output files listed the two alleles as the observed allele and 0.This has now been changed so that the observed allele is listed twice.This provides better compatibility with IMPUTE2. v1.r415: * Major: Unrelateds are ordered in the same way in input and output files. * Minor: Covariates and phenotypes in input sample file are given in output sample file (still don't work when the option --output-graph is used). * Minor: When output window coordinates are not consistent with input window coordinates, they are automatically fixed v1.r378: * Major: SNPs are not uniquely identified by their ID anymore, but rather by their physical position. * Minor: Compiled on older Linux version for better compatibility. * Minor: Improved initial data checking (calling rate / Mendel error rate). * Minor: New data sub-setting options ( --include-ind --exclude-ind --include-snp --exclude-snp --output-from --output-to ) * Minor: New testing option to check that data reading is OK ( --test-reading )