Genome-wide association studies (GWAS) serve as a powerful tool to identify statistically significant, disease-associated variants. However, for many of the disease-associated variants reported by GWAS, the biological mechanisms underlying their association remain unclear. Thus, it is necessary to identify genes potentially affected by these reported variants as possible sources of the disease-association signals. Identifying the candidate disease genes to associate with a variant can be difficult, as many variants lie in non-coding (intronic or intergenic) regions of the genome and therefore share no obvious connection with any gene.
Currently, candidate disease genes are typically determined to be those genes in closest proximity to the variants in question—a method that does not effectively handle variants found in gene deserts, or in close proximity to multiple genes. Furthermore, this method overlooks the possibility that reported variants may be contained in regulatory elements and therefore affect even distant genes.
Here, we present a Java and Perl-based application with a command line interface for post-GWAS analysis. The application incorporates gene regulatory information in producing a list of scored candidate disease genes from a set of GWAS-reported variants, where candidate scores indicate the strength of a gene’s relationship to the disease in question. Thus, our program can identify risk genes both proximal and distal to GWAS-reported variants and determine those risk genes most likely to be the sources of disease-association signals.
PGA: post-GWAS analysis for disease gene identification. Lin JR, Jaroslawicz D, Cai Y, Zhang Q, Wang Z, Zhang ZD (2017) Bioinformatics (In press)
» The stand-alone executable file
» System requirements
» Running PGA
» Input files from the user
» Output files
chr4 182101688 182101689 rs100182880-based and 1-based indexing are both accepted. The file for the case study of Alzheimer's disease can be downloaded here).
chr2 127852020 127852021 rs10207628
chr12 106108718 106108719 rs10219670
chr7 146897402 146897403 rs10273775
chr18 56752053 56752054 rs1037757
chr14 92926951 92926952 rs10498633
java -Xmx4g -cp src edu.yu.einstein.zdzlab.pga.GWASAnalyzer ALZ EUR
Our post-GWAS analysis application performs two distinct operations: it identifies candidate disease genes given a set of GWAS-reported variants, and it scores these candidate genes to prioritize those most likely to be the sources of the disease-association signals.
» Identifying risk regions
As GWAS only detect statistical associations from among a pre-select subset of variants, it is necessary to identify all unexamined variants that are in strong linkage disequilibrium (LD) with the GWAS-reported SNPs/indels as potential alternative sources of the disease-association signal. Using VCFtools (Danecek, et al., 2011) and an 1,000 Genomes Project (1KG) reference panel (Genomes Project, et al., 2012), we calculate the LD between each GWAS-reported variant and every 1KG variant within a 400-kb range. An LD block is then formed from all within-range SNPs with r2 > 0.5 and is indexed by the corresponding GWAS-reported variant. We can then use these LD blocks to identify both proximal and distal risk genes.
» Identifying proximal risk genes
We merge all overlapping or adjacent (within 250 kb) LD blocks to form genomic risk regions. Proximal risk genes are identified as those genes that—after extending their ranges by 20 kb on each end—overlap with these genomic risk regions.
» Identifying distal risk genes
In order to incorporate gene regulatory information, we collected lists of expression quantitative trait loci (eQTL) and transcriptional regulatory elements (TREs) with their target genes from ENCODE (Thurman, et al., 2012), FANTOM5 (Andersson, et al., 2014), and GTEx (Consortium, 2013) data. Distal risk genes are determined to be those genes affected by an eQTL or TRE that contains any of the variants found in an LD block (including the GWAS-reported variant itself).
» Scoring risk genes
Analyzing GWAS-reported variants can produce a large number of candidate disease genes—particularly when gene regulatory information is incorporated as well—and it is therefore useful to identify the most likely candidates among these risk genes. Our application employs a statistical method to score the disease-relatedness of risk gene candidates with predictive features extracted from gene networks and annotation based on a set of training genes—genes known to play a role in the etiology of the disease in question.
Given this set of known disease genes D and the set of known genes G (from GENCODE v19), we obtain the set of background genes B = G – D. Then, from the set of known disease genes D we extract our predictive features—the frequent combinations of the neighbors of genes in D in the gene network we employ (need updated FLN), or of the Gene Ontology (GO) terms associated with the genes in D. Our application uses the FP-growth algorithm for frequent itemset mining (HAN et al. 2000) with a support value of ⌈(|D|)/10⌉. Additionally, GO terms of genes in D include both annotated GO terms and their ancestor GO terms along the path of the 'is a' relationship in the GO hierarchy structure. The predictive features generated are limited to 3-itemsets to avoid redundancy and intensive computation.
After extracting our predictive feature set from D, we assign a score to each feature f in the set based on the frequency of its association with genes in D and B:
in which FD is the frequency with which f occurs in D and ND is the number of genes in D. FB and NB have similar meanings. Next, for each candidate disease gene, we identify all the predictive features with which it is associated and assign it the highest score of these features. In the event that a risk gene candidate is a training gene as well, the score Sf of each predictive feature it contains must be adjusted:
As network and annotation scores are treated separately, each gene now has two different scores that must be combined to produce a final gene score:
in which Sf(n) and Sf(a) are the network and annotation-based scores of the gene in question, respectively, and α is a coefficient controlling the amount of influence these two scores have, relative to each other, on the final gene score. α set at 0.4 will yield the best predictive power according to the result of our evaluation (the data is not shown). Every candidate gene is scored using gene sets B and D excluding the candidate gene itself to avoid biased scoring.
Notably, our gene scoring method is limited by information available about known disease genes and is based on the hypothesis that novel disease genes will be involved in the same pathways and mechanisms as known disease genes.
» Evaluating scores
Our application produces a high scoring cutoff threshold in order to give the scores generated some context. The threshold is the value that achieves a prediction precision ≥ 0.8.
» Application output and efficiency
Our post-GWAS analysis application produces a list of candidate disease genes based on user-input, GWAS-reported variants, a list of scored candidate disease genes based on the scoring method outlined above, and intermediate files such as lists of network and annotation-based predictive features that can be reused with the same training gene set in order to avoid redundant computation. The first operation performed by the application, generating risk gene candidates, requires between 15 and 90 seconds per input variant, dependent on the size of the 1KG reference panel in use. The next operation, scoring these candidate genes, can take less than one to several hours, dependent on the size of the network/annotation data in use and the relationship between this data and the training gene set used to generate predictive features. As the frequent itemset mining required to generate predictive features takes a majority of the time in this operation, using pre-generated predictive feature sets significantly reduces the time necessary to score candidate genes.
Andersson, R., et al. An atlas of active enhancers across human cell types and tissues. Nature 2014;507(7493):455-461.
Consortium, G.T. The Genotype-Tissue Expression (GTEx) project. Nature genetics 2013;45(6):580-585.
Danecek, P., et al. The variant call format and VCFtools. Bioinformatics 2011;27(15):2156-2158.
Genomes Project, C., et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491(7422):56-65.
Jiawei, H., et al. Mining frequent patterns without candidate generation. In, SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Dallas, Texas, USA.
Thurman, R.E., et al. The accessible chromatin landscape of the human genome. Nature 2012;489(7414):75-82.