PICARA: In the quest for the wisdom of crowds

The speed, cost, and accessibility of DNA sequencing has been transformed in recent years by new technologies, opening up exciting opportunities for disease diagnosis, therapeutic intervention, and studying complex trait variations. Chief among these are genome wide association studies, frequently referred as GWAS, where researchers look for SNP genetic polymorphisms that give raise to phenotypic variation or are in linkage disequilibrium with the causative genetic variants. To further annotate the effect of these associations on phenotypes, researchers often search and collect relevant information from the literature, public resources, and databases, seeking supporting evidence that pillars the peaks of these significant associations.
In order to automate such processes, the Gramene Diversity team at Cornell University has created a new analytical pipeline, using the curated Gramene database, which allows functional annotation for a priori candidate gene based on the co-localization of enriched GWAS signals with integrated knowledge that pertains to the same biological phenomena. This newly published enrichment pipeline – PICARA (taking its name from a fictional female adventurer) – is built upon a Bayesian framework, and provides a probabilistic inference that presents a degree of belief on functional implication when assessing a priori candidates. Functional characteristics of a priori candidates with strong statistical support are examined using a phylogeny-based gene homology search. PICARA addresses limitations of previous approaches that used fixed window sizes by estimating the extent of linkage disequilibrium dynamically around a priori candidates according to their local SNP distributions. The resulting linkage blocks are then used to delineate the target haplotypes containing genetic variants and potential a priori candidates genes of interest.
In a recent publication, we demonstrated the performance of PICARA using genome-wide association data for flowering time variation in maize – a key trait for geographical and seasonal adaption of plants. Taking the wealth of knowledge accumulated in the research community, we have curated and created a database of flowering time related genes that are experimentally defined from reference systems (i.e. Arabidopsis, in plant cases). Out of a total of 1,536 flowering time related maize homologs identified from a multi-species phylogenetic comparison, PICARA’s enrichment function was used to pinpoint putative maize genes that are orthologous to key regulators in the Arabidopsis flowering time pathway, such as FT (Flowering Locus T), LHY (Late Elongated Hypocotyl) and GI (GIGANTEA). In addition to these important regulators, PICARA’s enrichment capability enabled the discovery of a regulatory feature that fine-tunes the flowering time variation among maize NAM populations, as a result of a near Fisher’s genetic adaptation model with many minor genetic variants.
With this publication, the Gramene Diversity team highlights the value of information that lies in the wisdom of crowds. We propose that, as more genomic sequences and functional data from both economically and phylogenetically important species becoming available, functional characterization of a priori candidates can be further accelerated by effectively and systematically integrating heterogeneous biological data into testable hypotheses.
The PICARA publication and software is available from PLoS One. Gramene plans to host PICARA’s web interface. In the meantime, a test server for PICARA is available at Cornell University.

Author: Charles Chen