Computational Methods for Learning Population History
from Large Scale Genetic Variation Datasets
Population Genetics, Minimum Description Length, Population History, Markov Chain Monte Carlos, Coalescent Theory, Genome Wide Association Study
Understanding how species have arisen, dispersed, and intermixed over time is a fundamental question in population genetics with numerous implications for basic and applied research. It is also only by studying the diversity in human and different species that we can understand what makes us different and what differentiates us from other species. More importantly, such analysis could give us insights into applied biomedical questions such as why some people are at a greater risk for diseases and why people respond differently to pharmaceutical treatments. While there are a number of methods available for the analysis of population history, most state-of-the-art algorithms only look at certain aspects of the whole population history. For example, phylogenetic approaches typically look only at non-admixed data in a small region of a chromosome while other alternatives examine only specific details of admixture events or their influence on the genome.
We first describe a basic model of learning population history under the assumption that there was no mixing of individuals from different populations. The work presents the first model that jointly identifies population substructures and the relationships between the substructures directly from genetic variation data. The model presents a novel approach to learning population trees from large genetic datasets that collectively converts the data into a set of small phylogenetic trees and learns the robust population features across the tree set to identify the population history.
We further develop a method to accurately infer quantitative parameters, such as the precise times of the evolutionary events of a population history from genetic data. We first propose a basic coalescent-based MCMC model specifically for learning time and admixture parameters from two-parental and one-admixed population scenarios. As a natural extension, substructures and learn population models and the specific time and admixture parameters pertaining to the population history for three or more populations. Analysis on simulated and real data shows the effectiveness< of the approach in working toward unifying the learning of different aspect of population history into single algorithm.
Finally, as a proof of concept, we propose a novel structured test statistic using the historic information learned from our prior method to improve demographic control in association testing. The success of the structured association test demonstrates the practical value of population histories learned from genetic data for applied biomedical research.