Faculty Research Pages
Below is a sampling of active Computational Biology Department research projects. Additional research projects are described on individual faculty homepage (see links below).
Active learning of cell organization
A team led by Bob Murphy, Head of the Computational Biology Department and a faculty member in the Machine Learning Department, is combining image-derived modeling methods with active learning to build a continuously updating, comprehensive model of protein localization. The organization of eukaryotic cells into distinct organelles and structures is critical to cell health and function. Alterations in protein localization have been associated with many diseases, including cancer. Obtaining a complete picture of the localization of all proteins in cells and how it changes under various conditions is therefore an important but daunting task, given that there are on the order of a hundred cell types in a human body, tens of thousands of proteins expressed in each cell types, and over a million conditions (which include presence of potential drugs or disease-causing mutations). Automated microscopy can help by allowing large numbers of images to be acquired rapidly, but even with automation it will not be possible to directly determine the localization of all proteins in all cell types under all conditions. As described by Dr. Murphy in a recent Commentary in Nature Chemical Biology, the alternative is to construct a predictive model of protein localization from all existing information, and use it iteratively to identify those experiments that will are expected to maximally improve the model. This is an example of Active Learning, a powerful machine learning technique. The fundamental assumption when applied to biological experiments is that there are correlations between the results of experiments for different variables and that learning these correlations can allow accurate prediction of results for experiments that have not been (and may never be!) done.
Preliminary results in test systems have been very encouraging. Using active learning on data from Pubchem, a repository of results from high throughput drug screens, nearly 60% of drug “hits” were found while doing only 3% of all possible experiments. Encouraging results have also been obtained in extensive computer simulations of learning the effects of many drugs on many proteins. Based on these results, active learning-driven automated imaging of cell lines expressing different fluorescently-tagged proteins is currently being carried out in Dr. Murphy’s laboratory. The project draws on a range of cutting edge methods, and collaborators and consultants include a number of faculty members from the Machine Learning and Computational Biology Departments, including Chris Langmead, Aarti Singh, Gustavo Rohde, Jaime Carbonell, and Jeff Schneider.
Algorithms for Large-scale Genomics
Carl Kingsford’s group works on developing novel algorithms for various computational problems that arise when trying to understand how the genome encodes information and how it affects phenotypes. We focus on way to achieve extreme speed while improving accuracy of the results. For example, we have developed techniques for estimating isoform expression levels from RNA-seq (our software Sailfish), fast ways to search thousands of short-read sequencing experiments (our Sequence Bloom Trees data structure), accurate approaches for estimating properties of the spatial structural of chromatin (our Armatus software), and methods for tracking the movement of ribosomes along a transcript (e.g. Ribomap), among many other tools. A few representative examples of our ongoing projects include:
Human leukocyte antigen (HLA) typing via whole-genome sequencing (Project lead: Heewook Lee, Ph.D.)
HLA genes are crucial to the regulation of immune system as they encode for major histocompatibility complex (MHC) containing cell surface proteins that control the adaptive immune response. For this reason, accurate HLA typing is important in clinical settings as well as in research. Currently, HLA typing is often done using labor intensive and costly PCR-based targeted sequencing or serology based methods. With increasing availability of WGS data of patients, a development of accurate computational methods can be valuable. Challenges in computational HLA typing are mainly driven by the hyper-polymorphic nature of HLA regions. We formulated HLA typing problem as that of finding two paths through the partial order alignment graph of multiple sequence alignments of known HLA alleles. On this graph, we can assign weighted to edges based on read alignment covering such edges so that weights can be used for solving the flow problems to correctly infer diploid alleles of an individual.
Transcriptome and metagenomics assembly (Project lead: Mingfu Shao, Ph.D.)
We are seeking to create better assembly algorithms for reconstructing sequences from short reads, including both reference-based and de novo methods. The core computational challenges for these problems are the same, which is to solve the minimum path flow decomposition problem. We have devised a novel and efficient algorithm for this problem, based on essential structures and properties we proved. We aim to release assemblers based on this algorithm, as well as to use them to analyze vast available datasets.
Identifying functional spatial relationships in chromatin
We are interested in finding genetic elements (enhancers, eQTLs, genes, histone marks, etc.) that have a role related to the spatial organization and packing of the genome. This organization is important for gene regulation, replication, and disease, and understanding its genetic basis is essential to understanding how the genome functions. We are developing computational methods to identify long-range functional interactions and compact regions of chromatin (topologically associated domains), and to associate those spatial relationships with gene expression, genomic variation, epigenetic modifications, and functional pathways.
Large-scale experiment discovery
There are hundreds of thousands of sequencing experiments that are publicly available from various organisms and in various conditions. However, finding relevant experiments within this huge collection is nearly impossible. We are working on solving this experiment discovery problem by constructing new, distributed search and indexing mechanisms that will allow world-wide search of raw sequencing experiments. Fast access to tens of thousands of sequencing experiments will dramatically change how biological hypotheses are constructed and how biological questions are answered. This work will unlock the potential of the huge and rapidly growing collection of sequencing experiments now available.
Algorithms in Nature
Recent advances in our ability to generate and analyze biological data, including new genomics and imaging techniques, allows us to design experiments to test in more detail how biological systems operate; In turn, we can use the inferred activity to develop and apply new algorithms for computational problems. Ziv Bar-Joseph’s group has been working on such bi-directional studies for the last 6 years. We demonstrated the usefulness of such approaches by studying how drosophila (fruitfly) cells solve the Maximum Independent Set problem, how the structure of yeast molecular interaction networks leads to resilience to noise and attacks and how the brain is using pruning algorithms during development to construct adaptive and efficient neural networks. Most recently we have looked at how bacteria cells solve the Distributed Gradient Descent problem when searching for food in a constrained environment. Understanding of the computation performed by these processes led to novel algorithms for distributed computational systems, some of which have already been used by engineers, for example when designing wireless networks.
Bringing machine learning to the clinic
Chris Langmead’s group is working to use machine learning for clinical application. Genomic medicine promises to revolutionize the way patients are treated, enabling quantification of risks leading to proactive preventative care and creating treatments that are precisely tailored to a particular person. Efforts are now underway to obtain not only the genome sequences of thousands of individuals, but also of healthy cells and of cancer tumor cells for the same individual. As the cost of sequencing continues to drop, it will become routine to sequence patients in a medical setting. However, a number of computational and practical challenges remain blocking the use of genomic sequencing for clinical decision making.
Due to lack of good predictive models and computational tools for connecting disease expertise with machine learning tools, the adoption of genetic predictions is currently slow, except in a few well-studied cases. In this project, we are aiming to increase the usage of predictive systems based on machine learning techniques and genomics. First, in collaboration with UPMC, we are developing models for predicting outcomes in prostate cancer patients. Second, we are creating a user-friendly, automated computational pipeline that will make it easier for clinical researchers to build and refine predictive models for many diseases. This pipeline will also make it simple to share and apply these models at both large and small clinics.
Chris Langmead’s and Wei Wu’s groups are collaborating on the Computational Medicine project. Modern medicine increasingly relies on data science and predictive modeling to enhance our understanding of the basis of human disease and to guide treatment in a patient-specific fashion (personalized and precision medicine). The size, volume, heterogeneity, and complexity of the data available to physicians pose many computational and modeling challenges in pursuit of the overarching goal of improving human health. My group develops novel methods and models to address these challenges. Specific examples include: triage algorithms for anticipating complications in acute care settings (e.g., pancreatitis, sepsis); biomarker discovery algorithms from genomic, proteomic, and electronic health records (e.g., pancreatic cancer, melanoma, lung cancer, prostate cancer); treatment optimization methods that avoid drug resistance; and cohort design methods that reduce the cost of acquiring training data. Algorithmically, my group uses a combination of Machine Learning (including Probabilistic Graphical Models, Active Learning and Deep Learning), Game Theory, and Dynamical Systems Modeling.
How The Genome Evolved for Vocal Learning and Speech Production
The Neurogenomics laboratory led by Assistant Prof. Andreas Pfenning is studying how the genome evolved for vocal learning, an aspect of speech production that sets us apart from other primates.
Although nothing is more central to our identity as human beings than behavior, we still don’t know how the billions of nucleotides in our genome have evolved to give us complex thoughts or the ability to communicate those thoughts through speech. The component of speech that gives us a window into the genetic basis of behavior is vocal learning, which is the ability to reproduce complex vocalizations. Although speech itself is uniquely human, vocal learning is an example of convergent evolution, having evolved independently in multiple species, including songbirds/parrots, hummingbirds, bats, elephants, and also in humans relative to chimpanzees. This allows the Neurogenomics Laboratory to take a comparative genomic approach to understanding how vocal learning evolved: what features do the genes and the genomes of vocal learning species have in common relative to those without the ability?
We approach this problem using a wide set of techniques. The Neurogenomics laboratory develops and adapts computational methods to identify regulatory elements in the genome associated with vocal learning behavior. We then use genomic techniques, such as high-throughput reporter assays and genome editing, to validate and refine the computational models. The results of these experiments will not only yield insights into how human speech evolved, but also provide a foundation for others who are aiming to connect genome sequence to organism traits using the wealth of new genomic data available.
Molecular Design and Simulation
The ability to design novel proteins and small molecules is essential to the pharmaceutical, energy, and other industries. Chris Langmead’s group has developed novel methods for revealing the physical factors that govern the properties of natural molecules, and for efficiently searching through combinatorial spaces to find optimal designs for a given task. Our methods encode the design space using graphical models (including graphical games) and then use message-passing methods to perform key tasks, such as free-energy calculations, resistance modeling, and protein design. We also use Molecular Dynamics simulations to further evaluate and refine our designs.
Statistical Machine Learning Methods for Genetical Genomics Analysis
The standard approach for identifying gene networks is based on experimental perturbations of gene regulatory systems such as gene knock-down experiments, followed by a genome-wide profiling of differential gene expressions. However, this approach is significantly limited in that it is not possible to perturb more than one or two genes simultaneously to discover complex gene interactions or to distinguish between direct and indirect downstream regulations of the differentially-expressed genes. As an alternative, genetical genomics study has been proposed to treat naturally-occurring genetic variants as potential perturbants of gene regulatory system and to recover gene networks via analysis of population gene-expression and genotype data. Seyoung Kim’s group is developing a statistical framework for learning gene networks that overcomes the limitations of experimental perturbation methods and addresses the computational challenges of genetical genomics analysis. Our computational approach is based on probabilistic graphical models and we are developing efficient learning and inference algorithms that can be applied to datasets with tens of thousands of gene expressions and millions of genetic variants. We are applying our computational technique to study gene regulatory networks and their genetic control mechanisms in yeast and humans.
The Immune Basis of Alzheimer’s Disease
The Neurogenomics laboratory led by Assistant Prof. Andreas Pfenning is studying how the immune system influences Alzheimer’s disease predisposition.
Alzheimer’s disease is a complex neurodegenerative disorder that has an enourmous impact on society, but is still poorly understood. Dr. Pfenning’s previous research has shown that sequence mutations associated with Alzheimer’s Disease are more likely to be found in regions of the genome that are active in the immune cells, and not within the brain. The goal is this research is to uncover how the immune system influences Alzheimer’s disease predisposition and progression.
We are tackling this problem by building computational models that combine human clinical data, large-scale genetics, epigenomics, and mouse models of neurodegeneration. These experiments aim to uncover biomarkers for Alzheimer’s disease and provide the foundation for immune-based therapeutics.
Computational Biology faculty member Russell Schwartz is directing a research project to apply methods from phylogenetics (i.e., the inference of evolutionary trees) to identify patterns of evolution of cell lineages within cancers. Cancer is a class of diseases defined by the accumulation of successive mutations that disable normal controls on cell growth, allowing rapid cell proliferation that typically results in tumor growth and eventually metastasis. A single tumor is therefore a kind of evolving population, characterized which rapid mutation and strong selection for cells that can proliferate quickly and evade normal cellular controls on growth or anti-tumor therapies. This observation led to an approach called tumor phylogenetics (Desper et al., 1999) in which one applies phylogenetic algorithms to genetic profiles of tumors to infer evolutionary trees that describe the common features of progression across tumors. These trees then describe the sequences of mutations, also known as progression pathways, that characterize common tumor types. These tumor phylogenies can help researchers identify common features among seemingly different tumors, discover particularly important steps in evolution, and gain insight into the mechanistic reasons why particular patterns of progression seem to recur across many patients. It is hoped that such insights will help us identify new targets for anti-tumor drugs and develop better diagnostic tests to determine which patients are likely to benefit from which therapies.
This work arose from a collaboration between Schwartz and Dr. Stanley Shackney and is now being carried out in collaboration with Alejandro Schäffer, Kerstin Heselmeyer-Haddad, and Thomas Ried of the National Institutes of Health. A particular focus of their work is the question of cell-to-cell heterogeneity in single tumors. The Schwartz lab has pursued this problem from several directions. These include the use of direct single cell data, through a special focus on the use of fluorescence in situ hybridization (FISH) data as well as algorithmic work on the use of single-cell sequence data sets. A second direction is the use of deconvolution (unmixing) methods to reconstruct the genomics of cell subpopulations from bulk genomic data. There work spans modeling and algorithmic directions in better representing the process of tumor evolution and heterogeneity and improving algorithms to fit data to those models, machine learning methods for applying these reconstructions to predicting tumor progression and patient prognosis, and application of these methods with research collaborators.
A recent talk by Schwartz at the Simons Institute for Theoretical Computer Science on some of his directions in FISH-based tumor phylogenetics is shown below.