Sept 30, 2016
Sara Mostafavi, University of British Columbia
Inferring molecular and cellular networks to understand complex traits
The availability of varied and large-scale genomics data, including RNA-sequencing transcriptomic data, provides new opportunities for deriving coherent and context-specific molecular networks underlying complex cellular traits and phenotypes. However, deriving meaningful biological insights from these data requires addressing significant statistical and computational challenges, including prevalence of systematic confounding factors and low statistical power. In this talk, I will present two projects focused on integrating heterogeneous data, in order to derive molecular networks relevant for understanding complex human disease. In the first part of the talk, I’ll describe an approach for building co-expression networks across a large number of human tissues which increases statistical power and robustness of network inference. In the second part of the talk, I’ll describe a network-based approach for relating transcriptomic patterns from cortex to neuropathology and cognitive decline, while accounting for known and hidden confounding factors.
Sept 23, 2016
Miler Lee, University of Pittsburgh
Eggs, embryos, and pluripotency: gene regulation during early development
In 2006, Kazutoshi Takahashi and Shinya Yamanaka demonstrated that exogenous introduction of four transcription factors could induce adult mouse fibroblasts to de-differentiate to a pluripotent cellular identity, a finding that spawned a new subfield of biomedical research and earned Yamanaka a share of the 2012 Nobel Prize in Physiology or Medicine. In fact, these results build on research from decades earlier pioneered by John Gurdon, who shared the Nobel Prize for his work showing that an egg cytoplasm can cause a differentiated nucleus to revert to a pluripotent transcriptional program. These results highlight that potent mechanisms controlling cellular identity are contained within the maternally contributed contents of an egg. In this talk, I will describe my work to understand the molecular determinants of cellular identity and pluripotency, from the perspective of the maternal contribution in vertebrates. I will present results that demonstrate the role of maternal reprogramming factors in initiating de novo transcription in the zebrafish embryo, and describe my new efforts toward understanding how the maternal contribution has evolved across the diversity of vertebrates and their divergent embryogenesis strategies.
Sept 16, 2016
Tandy Warnow, UIUC
Grand Challenges in Phylogenomics
Estimating the Tree of Life will likely involve a two-step procedure, where in the first step trees are estimated on many genes, and then the gene trees are combined into a tree on all the taxa. However, the true gene trees may not agree with with the species tree due to biological processes such as deep coalescence, gene duplication and loss, and horizontal gene transfer. Statistically consistent methods based on the multi-species coalescent model have been developed to estimate species trees in the presence of incomplete lineage sorting; however, the relative accuracy of these methods compared to the usual “concatenation” approach is a matter of substantial debate within the research community. In this talk I will present new state of the art methods we have developed for
estimating species trees in the presence of incomplete lineage sorting (ILS), and show how they can be used to estimate species trees from genome-scale datasets with high accuracy. I will also discuss tradeoffs between data quantity and quality, and the implications for big data genomic analysis.
Sept 9, 2016
Min Xu, Carnegie Mellon University
Molecular resolution structural pattern mining inside single cells
Cryo-electron tomography enables 3D visualization of cells in a near native state at molecular resolution. The produced cellular tomograms contain detailed information about all macromolecular complexes, their structures, their abundances, and their specific spatial locations and orientations inside the field of view. However, extracting this information is very challenging and current methods usually rely on templates of known structure. Here, we formulate a template-free structural analysis as a pattern mining problem and propose a new framework called “Multi Pattern Pursuit” for supporting de novo discovery of macromolecular complexes in cellular tomograms without using templates of known structures. Our tests on simulated and experimental tomograms show that our method is a promising tool.
May 17, 2016
Kaixuan (Kevin) Luo, Duke University
Modeling Nuclease Digestion Data to Predict the Dynamics of Genome-wide Transcription Factor Occupancy
Identifying and deciphering the complex regulatory information embedded in the genome is critical to our understanding of biology and the etiology of complex diseases. The regulation of gene expression is governed largely by the occupancy of transcription factors (TFs) at various cognate binding sites. Characterizing TF binding is particularly challenging since TF occupancy is not just complex but also dynamic. Current genome-wide surveys of TF binding sites typically use chromatin immunoprecipitation (ChIP), which is limited to measuring one TF at a time, thus less scalable in profiling the dynamics of TF occupancy across cell types or conditions. This work develops novel computational frameworks to model sequencing data from DNase and/or MNase nuclease digestion assays that allows multiple TFs to be surveyed in a single experiment. These frameworks serve as an innovative and cost effective strategy which enables efficient profiling of TF occupancy landscapes across different cell types or dynamic conditions in a high-throughput manner.
May 19, 2016
Dan Deblasio, University of Arizona
Parameter Advising for Multiple Sequence Alignment
When preforming multiple sequence alignments each aligner has a multitude of parameters that must be set, and can greatly affect alignment quality. Most users rely on the default parameter settings, which are optimal on average but may produce a low-quality alignment for the given inputs. In this talk I describe an approach called parameter advising to find a parameter setting that produces a high-quality alignment for each input. To perform parameter advising I developed a new accuracy estimator called Facet (short for “feature-based accuracy estimator”) that computes an accuracy estimate as a linear combination of efficiently computable feature functions. I further applied parameter advising (i) to ensemble alignment, which uses the advising process to choose both the aligner and its parameter settings, and (ii) to adaptive local realignment, which chooses distinct parameter choices to conform to mutation rates as they vary across the lengths of the sequences. Using Facet for parameter advising boosts advising accuracy by almost 20% beyond using a single default parameter choice for the hardest-to-align benchmarks.
April 1, 2016
Tim Hughes, University of Toronto
Decoding Gene Regulation
Understanding how regulatory sequence works is one of the greatest challenges facing molecular biology, and the next major hurdle in human genetics. There is now a wealth of data on individual genome sequences, chromatin profiles, and expression outputs, but much less is known about the mechanisms that specify and link them: the details of how cells identify regulatory sequences, or how their functions are exerted, are surprisingly difficult to decipher. I will describe three interlinked objectives of my research program: determination and compilation of motifs for transcription factors (TFs); mapping of their effector functions; and development of computational models of regulatory sequence identity and function.
March 30, 2016
Morgan Wirthlin, Oregon Health and Science University
Evolutionary Neurogenomics Provides Insight into the Basis of Learned Behavior
Every spring, young songbirds around the world learn to sing by imitating their parents – at first awkwardly, but later with astonishing accuracy – in a process remarkably similar to how babbling infants learn to speak. How do complex, learned behaviors such as these evolve? My research explores this question by approaching computational genomics from an evolutionary systematics perspective. In seeking to identify general principles that govern the evolution of complex behavior, vocal learning provides an ideal natural experiment: the behavior, along with the neural circuitry that supports it, has evolved independently in a handful of avian and mammalian lineages. Remarkably, we found that vocal learning songbirds, parrots, hummingbirds, and humans have convergently evolved shared gene expression patterns in functionally analogous brain circuits for vocal control. Genomic comparisons of song-learning birds with their vocal non-learning relatives have also revealed suites of lineage-specific novel genes, some of which are selectively expressed in neural structures devoted to learned vocal behavior. Finally, vocal learning circuits’ transcriptional networks are associated with specific gene regulatory elements that could serve to coordinate their expression, defining the molecular pathways and physiological properties that distinguish cell types within vocal control circuits.This ongoing work supports a model where the evolution of novel genes and gene regulatory regions alter transcriptional networks in the brain, giving rise to new anatomical structures and molecular pathways, ultimately resulting in adaptive changes in behavior. These evolutionarily informed findings have implications for human disease, where mutations in these critical genomic elements result in behavioral pathology.
March 29, 2016
Jun Ding, University of Central Florida
Computational Methods for Transcriptional and Post-transcriptional Gene Regulation
Regulation of gene expression includes a variety of mechanisms to increase or decrease specific gene products. Gene expression can be regulated at both transcriptional and post-transcriptional stage and it is essential to almost all living organisms, as it increases the versatility and adaptability by allowing the cell to express the needed proteins.
We comprehensively studied the gene regulation from both transcriptional and post-transcriptional points of view. Transcriptional regulation is by which cells regulate the transcription from DNA to RNA, thereby directing gene activity. Transcriptional factors (TFs) play a very important role in transcriptional regulation and they are proteins that bind to specific DNA sequences (regulatory elements) to regulate the gene expression. Current studies on TF binding are still very limited and thus, it leaves much to be improved on understanding the TF binding mechanism. To fill this gap, we proposed a variety of computational methods for predicting TF binding elements, which have been proved to be more efficient and accurate compared with other existing tools such as DREME and RSAT peaks-motif. On the other hand, studying only the transcriptional gene regulation is not enough for a comprehensive understanding. Therefore, we also studied the gene regulation at the post-transcriptional level. MicroRNAs (miRNAs) are believed to post-transcriptionally regulate the expression of thousands of target mRNAs, yet the miRNA binding mechanism is still not well understood. We explored both the traditional and novel features of miRNA-binding and proposed several computational models for miRNA target site prediction. The developed tools outperformed the traditional microRNA target prediction methods (.such as miRanda and TargetScan) in terms of prediction accuracy (precision and recall).
March 25, 2016
Andreas Pfenning, Carnegie Mellon University
The genetic basis of brain aging and Alzheimer’s disease
The process of aging is associated with broad changes in the brain at the cognitive, neural circuit, and cellular level. Aging of the brain progresses at different rates in the human population, but the genetic basis of those differences has remained unclear. In this study, we searched for a genetic signature of aging by combining genotype data with frontal cortex post-mortem gene expression data from four cohorts: University of Pittsburgh, NIH Braincloud, GTEx, and the Religious Order Study/Memory and Aging project based at Rush University. We found aging-associated genetic variation near synaptic genes, including one SNP that was nominally significant in all cohorts (p <0.01) and genome-wide significant in a meta-analysis (p=8.4E-9). A comparison brain age to Alzheimer’s disease and the underlying genetics showed that brain aging and APOE status are independent predictors of Alzheimer’s disease. Our study provides a systematic framework to uncover the mechanisms that drive brain aging and how it relates to neurodegenerative disorders like Alzheimer’s disease.
March 18, 2016
Jian Ma, Carnegie Mellon University
New methods for understanding the complexity of cancer genomes
Recent advances in next-generation sequencing (NGS) technologies have provided us with an unprecedented opportunity to better characterize the molecular signatures of human cancers. One hallmark of cancer genomes is aneuploidy, which engenders abnormal copy numbers amongst broadly connected sets of alleles. Structural variations (SVs) further modify the aneuploid cancer genomes into a mixture of rearranged genomic segments with extensive somatic copy number alterations (CNAs). In this talk, I will introduce a new algorithm called Weaver to provide integrated quantification of SVs and CNAs in aneuploid cancer genomes. Such an integrated approach enables a greatly enhanced grasp of the complex genomic architectures inherent to many cancer genomes. Our evaluations demonstrated that Weaver is highly accurate and will greatly refine the structural analysis of complex cancer genomes.
February 26, 2016
Jennifer Listgarten, Microsoft Research
Correction for Confounding Factors in Genome and Epigenome-Wide Association Studies
Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Genome and epigenome-wide associations, wherein individual or sets of (epi) genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be found by use of a simple statistical test. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. These confounders include population structure, family relatedness, and cell type heterogeneity. I will discuss state-of-the art statistical approaches, based on linear mixed models, for conducting these analyses. In these approaches, confounding factors are automatically deduced, and then corrected for. Challenges include efficient computation and model optimization for increased power. Finally, I will discuss how insights from these areas can be leveraged to tackle the problem of uncovering latent sub-phenotypes—that is uncovering hidden case clusters for imprecisely defined phenotypes such as depression and type 2 diabetes.
January 29, 2016
Arvind Ramanathan, Oak Ridge National Laboratory
Reverse engineering the function of an intrinsically disordered protein
The ability of intrinsically disordered proteins (IDPs) to adopt substrate-specific three-dimensional (3D) structures in response to specific stimuli such as other proteins, small molecules, environmental and chemical changes enables them to propagate and relay a variety of control signals that ultimately determine the fate of a cell, including growth, reproduction and death. Reverse engineering the structural details of how IDPs morph and function is, therefore, a critical step towards developing novel therapeutic approaches to target cancer, diabetes, neurodegenerative and cardiovascular diseases. In this talk, I will outline some strategies we are developing at Oak Ridge National Laboratory in integrating neutron scattering techniques, molecular dynamics simulations and Bayesian inference methodologies to provide mechanistic insights into IDP function/dysfunction. I will also take this opportunity to present my views on establishing a research career at government labs.
December 11, 2015
Sui Huang, Institute for Systems Biology
Critical State Transitions, Rebellious Cells, and Why it is so Hard to Eradicate Cancer Cells
Single-cell gene expression analysis affords a new level of resolution for studying cell state dynamics. Cell differentiation into various cell types but also the development of malignant cells are manifestations of cell state dynamics. The ability of a complex gene regulatory network to produce, without mutations, a vast diversity of robust, biologically distinct, inheritable cell states (“attractors”) as manifestation of the principle of multi-stability in non-linear dynamical systems, has led to the idea that cancer are cells are trapped in “abnormal attractors” that are not meant to represent physiological cell states. This adds a layer of complication to the standard model of cancer in which Darwinian somatic selection of mutant cells that carry “driver mutations” drive tumor progression. This also means that “cancer without mutations” is in principle possible – as recently found.
We really need to overcome the orthodoxy of a rigid 1:1 mapping between genotype and phenotype in which genetic mutations are the sole agent of permanent and progressing change, and embrace non-genetic phenotypic plasticity, notably, inducible non-genetic state changes in our thinking about tumorigenesis. But to do so properly we need to go beyond hand-waging models and adopt a formal framework.
In this talk I will present the theoretical framework and the experimental findings supporting this thinking. We have formalized non-genetic cell phenotype plasticity as a dynamical system governed by the gene regulatory network. In this framework the distinct, stable biological cell states are attractor states in the high-dimensional gene expression state space. Cancer cells occupy particular (“physiologically forbidden”) attractor states, failing to descend to the “normal attractors”, and therapy constitutes a perturbation that seeks to push cells out of these cancer attractors into those that represent the apoptotic cell fates. In this formalism a transition between stable attractor states is a symmetry-breaking bifurcation in which the current attractor is destabilized and other attractors become accessible into which the cell will descend. This constitutes a much studied “critical state transition” but in a high-dimensional space. Importantly, in a complex multi-stable system (“rugged epigenetic landscape”) destabilization of an attractor also opens up new access to many “hidden” attractor states never intended to be occupied by a cell and even more different from the physiological ones. Now, as the cancer cells exit the cancer attractor during treatment-induced destabilization of their state, not only will they, as desired, move to the target phenotype (the apoptotic state) but: some cells may also “spill” into these newly accessible neighboring attractors which may represent even more stem-like, hence more malignant states. These aberrant non-killed “rebellious cells” triggered by sub-lethal therapy-stress may plant the seed for recurrence. It is in this sense that recurrence of tumors after treatment is not so much described by Darwinian “survival of the fittest” but perhaps more aptly by Nietzsche’s principle: “What does not kill me strengthens me”. Because of the fundamental need for attractor destabilization in therapy, it is likely that the latter principle widely applies. It does of course not exclude Darwinian selection of genetic mutants -in contrary it facilitates it by enhancing the probability of cells surviving treatment. Because of the importance of attractor destabilization we developed a tool to detect shifts of cell populations towards bifurcations in which attractors vanish. Indeed, we observed in single-cell resolution measurements of cells undergoing phenotype transitions signatures of postulated critical state transition, as well as the rebellious cells predicted by theory that move into attractors in the opposite direction from that of the desired transition. It follows the general postulate that there is an inherent limitation to any, however selectively targeting cancer drug, as long as it seeks to destabilize the cancerous state. Thus cancer therapy that seeks to kill tumor cells may be more akin to herding cats (than sheep): inherently very difficult.
November 13, 2015
Gaudenz Danuser, UT Southwestern Medical Center
Inferring causality in cellular pathways
One of the key questions in my lab deals with the proper identification of causal links between molecular processes in complex pathways. We define complexity as the product of high nonlinearity and high redundancy between processes. Intrinsic to such systems are adaptive responses to perturbations. Therefore, conventional molecular and genetic intervention studies often fail in providing information of the function of a targeted systems component. In this overview talk I will take cell protrusion as a prime example of a cell functional outcome of a pathway system with such properties. I will introduce the mathematical, computational, and experimental concepts of image fluctuation analysis as a method for accurate delineation of the functional hierarchy between molecular and mechanical processes driving cell morphogenic events. I will motivate this fundamental systems biological problem with experiments that highlight the mechanisms of action of several oncogenes in driving metastatic cell migration.
October 30, 2015
Ben Raphael, Brown University
Computational Characterization of Mutational Heterogeneity in Cancer
Advances in DNA sequencing technology have enabled large-scale measurement of the molecular alterations that occur in cancer cells. Translating this information into deeper insights about processes that drive cancer development demands novel computational approaches.
In this talk, I will describe algorithms to address two key problems in cancer genomics. First, I will describe techniques to identify combinations of mutations that perturb cellular signaling and regulatory networks. One algorithm employs a heat diffusion process to identify subnetworks of a genome-scale interaction network that are recurrently altered across samples. A second algorithm finds combinations of mutations that optimize a statistical measure of mutual exclusivity. Next, I will discuss approaches to deconvolve DNA sequencing data from bulk tumor samples and to derive a phylogenetic tree that relates subpopulations of tumor cells within these samples. I will illustrate applications of these approaches to multiple cancer types in The Cancer Genome Atlas (TCGA), including a recent Pan-Cancer study of >3000 samples from 12 cancer types.
April 23, 2015
Jian Ma, University Of Illinois at Urbana-Champaign
New Algorithms for Genome Comparisons
It is known that the distinctive features of human biology are largely the result of evolutionary changes to our genome. But most of the exact connections between genomic change and phenotypic innovation remain unclear. Advances in new sequencing technologies have provided us with unprecedented opportunities to tackle this question using comparative genomics. The insights from such comparative analysis can in turn help understand the human genome function and identify key genetic variants related to diseases. In this talk, I will introduce new algorithms we developed recently to facilitate the evolutionary analysis of non-coding regions of the human genome, including methods for whole-genome sequence alignment and modeling lineage-specific cis-regulatory elements. Finally, I will briefly introduce our recent work in cancer genome comparisons to study complex genomic alterations. Collectively, we hope our methods will contribute to accelerating our understanding of the genomic changes and gene regulation variations that result in phenotypic diversity and abnormality.
April 21, 2015
Aaron Wise, Ph.D. Candidate
Final Public Oral Examination
Computational Methods for Time Series Gene Expression Analysis
Time series expression data presents an opportunity to watch (and analyze) gene regulatory programs as they unfold. Here we address three problems in the realm of modeling dynamic gene regulation. We develop a novel set of modeling algorithms, using an Input-Output Hidden Markov Model (IOHMM) framework to build models of regulatory activity.
The first problem we address is combinatorial regulation. Genes are often combinatorially regulated by multiple transcription factors (TFs). Such combinatorial regulation plays an important role in development and facilitates the ability of cells to respond to different stresses. We present a new method called cDREM, capable of reconstructing dynamic models of combinatorial regulation. cDREM integrates time series gene expression data with (static) protein interaction data. The method is based on a hidden Markov model and utilizes the sparse group Lasso to identify small subsets of combinatorially active TFs, their time of activation and the logical function they implement.
The second problem is the modeling of multiple dynamic regulatory networks from multiple time series expression experiments. It is now possible to measure a patient’s gene expression during the course of a treatment. We wish to identify groups of patients with similar regulatory activity, with the expectation that this will relate to disease progression and treatment outcome. We present here a method called SMARTS that can be used to cluster patients based on the similarity of the regulatory program they are expressing, and then identify TFs which may be differentially active between the groups.
In SMARTS each dynamic regulatory model we build is created from a set of individual time series. Our third aim is to extend this technique to use sets of single cell gene expression experiments as the input to a regulatory model. We present a novel technique, SCAREDY-CAT, which is able to create such models. We use this technique to analyze the differentiation of lung epithelial cells, and show that we can reconstruct the structure of lung epithelium differentiation in an unsupervised manner.
We tie these methods together with the release of a software package that allows interested (non-technical) users to use our methods. By developing methods for understanding the regulatory dynamics present in time series data, we enable the discovery of regulatory relationships that help us understand biological systems and mechanisms underlying disease.
April 3, 2015
Michael I. Miller, Johns Hopkins University
Bayesian Deformable Templates in Computational Anatomy: Application for Neurodegenerative Diseases and Brain Clouds
I will mention several statistical estimation problems arising in computational anatomy associated to hypothesis testing of disease type and segmentation labels, as well as template estimation. This serves to motivate much of the focus of the talk on the development of an anatomically complex “random orbit model” for the subcortical brain.
I will show results from several of the neurodegenerative illnesses including the staging of neural network change in Alzheimer’s and Huntington’s disease. Time permitting I will discuss progress with Susumu Mori on building several Brain Clouds associated to young and aging subjects.
April 2, 2015
Marina Barsky, Ontario Institute of Cancer Research
Exploring the world with computer science tools
This talk is based on the author’s perception of computer science as a powerful set of tools for exploring real-world phenomena. Three important skills for computer science students to learn are (1) the ability to extract a model of the domain, (2) the ability to test the models by encoding it in software, and (3) the awareness of the available tools (algorithms, programming languages, data models) in order to be able to come up with the best solution for a problem at hand. In this talk we explore how this view may influence the design and delivery of courses in computational biology. The examples and outcomes from past teaching are presented, as well as future teaching aspirations.
March 19, 2015
Phillip Compeau, University of California, San Diego
Life After MOOCs: Online Science Education Needs a New Revolution
The recent “MOOC” revolution has largely focused on making low-cost online equivalents of offline lectures delivered to hundreds of students. I share the concerns about the quality of most MOOCs in their current form, which have too often been hyped as an educational cure-all. At the same time, I feel that much of the criticism of MOOCs stems from the fact that truly disruptive online educational resources have not been developed yet. I believe that MOOCs in technical disciplines can be transformed into a more effective educational product called a Massive Adaptive Interactive Text (MAIT) that cannot just help expand the horizons of online education but also improve how we teach offline courses, even at elite universities like CMU.
I will describe my own experience in online computational biology education, from co-founding the Rosalind platform for learning bioinformatics (http://rosalind.info), to co-developing the first MOOC in computational biology (http://coursera.org/course/bioinformatics), and I will discuss the steps that we are currently taking to transform our MOOCs into a MAIT.
March 5, 2015
Minli Xu, Ph.D. Candidate
Final Public Oral Examination
Comparative genomics reveals forces driving the evolution of Highly Iterated Palindrome-1 (HIP1) in cyanobacteria
The Highly Iterative Palindrome-1 (HIP1) is a highly abundant octamer palindrome motif (5’-GCGATCGC-3’) found in a wide range of cyanobacterial genomes from various habitats. In the most extreme genome, HIP1 frequency is as high as one occurrence per 350 nucleotides. This is rather astonishing considering that at this frequency, on average, every gene will be associated with more than one HIP1 motif. This high level abundance is particularly intriguing, considering the important roles other repetitive motifs play in the regulation, maintenance, and evolution of prokaryotic genomes. However, although first identified in the early 1990s, HIP1’s functional and molecular roles remain a mystery.
Here I present a comparative genomics investigation of the forces that maintain HIP1 abundance in 40 cyanobacterial genomes. My genome-scale survey of HIP1 enrichment, taking into account the background tri-nucleotide frequency in the genome, shows that HIP1 frequencies are up to 300 times higher than expected. Further analysis reveals that in alignments of divergent genomes, HIP1 motifs are more conserved than other octamer palindromes with the same GC content, used as a control. This conservation is not a byproduct of codon usage, since codons in HIP1 motifs are more conserved than the same codons found outside HIP1 motifs. HIP1 is also conserved on a broader scale. I predicted orthologs using the Notung software platform and compared enrichment of HIP1 motifs with control motifs across orthologous gene pairs. The similarity of HIP1enrichment in orthologs is significantly higher than the control. Taken together, my results provide the first evidence for the mechanism driving HIP1 prevalence. The observed conservation is consistent with selection acting to maintain HIP1 prevalence and rejects the hypothesis that HIP1 abundance is due to a neutral process, such as DNA repair. The evidence of selection thus suggests a functional role for HIP1. My analysis of the genome-wide spatial distribution of HIP1 suggests that the motif lacks periodicity, voting against a role in supercoiling. The spatial distribution of HIP1 motifs in mRNA transcript data from Synechococcus sp. PCC 7942 reveals a significant 3’ bias, which is suggestive of regulatory functions such as transcription termination and inhibition of exonucleolytic degradation. I conclude by discussing my findings in the context of cyanobacterial evolution and propose testable hypotheses for future work.
March 4, 2015
Denis Tsygankov, Research Assistant Professor of Pharmacology, Computational and Systems Biology, University of North Carolina at Chapel Hill
An Integrated Computational Approach to Study Emergent Multi-cellular Behavior during Vascular Tube Formation
Cerebral Cavernous Malformations (CCMs) develop in about 0.5 percent of the population worldwide. This disease is caused by mutations in one of three genes ccm-1, -2, or -3 that lead to enlarged leaky blood vessels. People with CCM experience seizures, paralysis, cerebral hemorrhage, and loss of hearing or vision. To understand the defects that lead to loss of proper vascular tube formation in CCM patients, we developed a novel computational image analysis technique to quantify the dynamics of individual cells and parametrize a multi-cell model for the collective behavior of endothelia cells during tube formation. Our multi-cell model takes into account interactions of the cells with the extracellular matrix and each other through the extension and retraction of protrusions. The model also allows for cell movement and changes in shape in response to forces exerted by neighboring cells. Our simulations not only reproduced experimentally observed patterns of tube formation in wild type and CCM knockdown cells, but also captured differences between the behavior of CCM1 and CCM3 deficient cells, providing mechanistic insight into the distinct roles of these proteins. Our model predictions have been confirmed by various experimental measurements including next-gen RNA sequencing and live cell imaging.
February 27, 2015
Ferhat Ay, University of Washington
Genome architecture in action: Gene regulation via 3D chromatin organization
The field of regulatory genomics has recently witnessed significantly increased interest in the three-dimensional structure of DNA in the nucleus, catalyzed by the development of chromosome conformation capture techniques (e.g., Hi-C) that profile genomic proximities on a genome-wide scale. Systematic analysis of these proximities is particularly important to identify targets of disease-associated genetic variants more than 90% of which reside in noncoding regions with unknown gene targets. In this talk, I will start with an overview of the diverse uses of conformation capture data and then present two recent projects concerning the interplay between genome form and function. First, I will talk about our study on the dynamic nuclear organization of the deadliest human malaria parasite (Plasmodium falciparum). Our study revealed that the parasite has a complex genome architecture shaped around precisely regulating its virulence genes and that this architecture goes through holistic changes in correlation with the parasite’s overall transcriptional activity during its cell cycle in human blood. Next, I will present a novel statistical method, Fit-Hi-C, for assigning confidence estimates to chromosome conformation capture data. Applied to Hi-C data from various human and mouse cell lines, Fit-Hi-C identified significant interactions that preferentially link expressed gene promoters to active enhancers, confirmed previously validated, cell line-specific regulatory interactions, and revealed that genomic regions with similar replication times prefer to be closer in 3D.
February 23, 2015
Assaf Gottlieb, Stanford University
Personalized medicine using patient similarities.
Clinical guidelines has vastly improved medical treatment. However, the increase in patient co-morbidities and polypharmacy and the rapid accumulation of genomic evidence calls for a more personalized approach, which tailors the treatment to the patient-specific characteristics. To-date, the integration of the different factors affecting patient treatment remains a challenge.
In my talk, I will present a data-driven patient-similarity framework which integrates patient similarities spanning demographic, clinical and genomic data sources and describe the genomic and pharmacovigilance components developed for this framework. Using this framework, treatment choices are suggested based on the choices and outcomes of similar patients, allowing every patient to contribute to the care of all future patients.
February 16, 2015
Andreas Pfenning, Massachusetts Institute of Technology
The genetic mechanisms underlying learned behavior and neurodegeneration
The brain is an enormously complex organ, but understanding how it works can help us to answer fundamental questions: How has human behavior evolved? How do we combat devastating neurological disorders? I will present results from two projects that begin to answer these questions from a computational genomic and gene regulatory perspective.
Song-learning birds and humans share independently evolved similarities in brain pathways for vocal learning that are essential for song and speech, not found in most other species. Comparisons of brain transcriptomes of song-learning birds and humans relative to vocal non-learners identified convergent gene expression specializations in the motor production regions of song learning birds and in the human laryngeal motor cortex. Thus, we find that convergent behavior and neural connectivity for a complex trait is associated with convergent specialized expression of multiple genes.
Alzheimer’s disease (AD) is a severe age-related neurodegenerative disorder characterized by cognitive decline as well as the accumulation of beta-amyloid and neurofibrillary tangles. We profile epigenetic dynamics in the hippocampus of an inducible mouse model of AD-like neurodegeneration and found a coordinated downregulation of synaptic plasticity regulatory regions, and upregulation of immune regulatory regions. Surprisingly, the human regions orthologous to increasing-level enhancers were strongly enriched for genetic variants associated with AD. Our results reveal new insights into the mechanisms of neurodegeneration and establish the mouse as a useful model for functional studies of AD regulatory regions.
Feb. 6, 2015
Aparna Kumar, Ph.D. Candidate
Final Public Oral Examination
Automated analysis of protein subcellular location in immunohistochemistry images for cancer diagnosis
Protein subcellular location and compartmentalization play an important role in regulating cellular processes. Protein mislocalization alters cell signaling and is observed in diverse diseases (Hung and Link 2011). Drug resistance can occur when proteins are mislocalized to the cytoplasm and nucleus, suggesting that the measurement of protein location can help clinicians personalize therapies and diagnose disease. Here, two projects explore how automatically quantitating subcellular location from pathology images can be used be in diagnostics and for understanding disease. 1) We developed an automated pipeline to compare the subcellular location of proteins between two sets of immunohistochemistry images. We used the pipeline to compare images of healthy and tumor tissue from the Human Protein Atlas, ranking hundreds of proteins in breast, liver, prostate and bladder based on how much their location was estimated to have changed. The performance of the system was evaluated by determining whether proteins previously known to change location in tumors were ranked highly. We present a number of new candidate location biomarkers for each tissue. Further we identified biochemical pathways that are enriched in proteins that change location. We confirmed some previously implicated pathways and we report new pathways previously unassociated with cancer to have changed. 2) We extended the IHC pipeline to process full slide images. Using the pipeline we explored how measuring changes in protein subcellular location can aid in identifying adult and pediatric liver lesions. Our results indicate that most of the time single protein measurements are poor markers for the lesions. Next we explored lesion-specific protein signatures for identifying diseases. Given our dataset we found a signature set of proteins that can successfully identify liver lesions in adult and pediatric populations with perfect accuracy. Finally we report two new proteins that aid in classifying the lesions when used as part of a signature protein set.
Feb. 6, 2015
Devin Sullivan, Ph.D. Candidate
Final Public Oral Examination
Image-derived generative modeling of complex cellular organization in both space and time
Understanding cellular organization is a major goal of systems biology. Cellular organization affects the behavior of cells and many diseases and disorders impact the spatial organization of cells and their morphologies in turn. There are many current means of studying these systems and their effects. High-content imaging is one high-resolution way in which to study the location of proteins within cells. Advances in imaging technologies have allowed for high quality data to be acquired from live cells in three dimensions over time. Historically, imaging data have been analyzed using image-feature based approaches to create models predicting cell state using classification or regression based machine learning. Generative modeling tools such as CellOrganizer offer an alternative approach to modeling cells and their subcellular structures. The added benefit of this class of approaches is that they describe the statistical distributions of cells and can be sampled from to create realistic in silico instances of cells and their subcellular organization. Despite our ability to model static subcellular organization, modeling the dynamic restructuring of cells and their components remains a major challenge in systems biology. These subcellular dynamics are strongly correlated with cell cycle and disease progression and understanding them will aid in the development of treatments. Towards this goal we trained generative models describing cellular morphology dynamics by using both time series and static-time cell image datasets. At a more granular level, cell function is dependent on the proteins within it and their interactions. Not only is the organization of cells correlated with cell response, but it may also be a driving force. To study the impact of cell shape and organization on these biochemical interactions we developed a computational pipeline to perform high-throughput spatially resolved simulations using realistic cellular geometries generated with CellOrganizer. In addition to exhibiting complex responses over time, some cells such as neurons are highly morphologically complex. As such, traditional generative modeling methods are ineffective or fail completely. We addressed this issue by expanding the capabilities of CellOrganizer to include models for neuronal shape. Together these works allow for the study of cellular and subcellular structure for realistic and complex cellular morphologies and their dynamic responses over time in high-throughput.
Feb. 5, 2015
Salim Chowdhury, Ph.D. Candidate
Final Public Oral Examination
Algorithms to Reconstruct Evolutionary Models of Tumor Progression
Cancer is one of the major causes of human mortality. Extensive genetic, epigenetic and physiological variations are observed within tumor cells, which complicate the diagnosis and treatment of the disease. Despite the extensive heterogeneity within single tumors, recurring features of their evolutionary processes are observed by comparing multiple regions or cells of a tumor. Recently, phylogenetic models have begun to see widespread use in cancer research to reconstruct processes of evolution in tumor progression. Mutations that drive development and progression of solid tumors typically include changes in the number of copies of genes or genomic regions. One particularly useful source of data for studying likely progression of individual tumors is fluorescence in situ hybridization (FISH), which allows one to count copy numbers of several genes in hundreds of single cells per tumor and thus especially well suited to characterizing intratumor heterogeneity. This dissertation focuses primarily on phylogenetic characterization of single tumors at the cellular level from FISH data. We first develop phylogenetic methods using single gene duplication to infer likely models of tumor progression at the cellular level from FISH copy number data and apply these to a study of FISH data from two cancer types. We next extend our single gene models to include copy number changes at the scale of entire chromosomes and the whole genome. We develop new provably optimal methods for computing an edit distance between the copy number states of two cells given evolution by copy number changes of single probes, all probes on a chromosome, or all probes in the genome. Our two proposed models for inferring phylogenies of single tumors by copy number evolution assume models of uniform rates of genomic gain and loss across different genomic sites and scales, a substantial oversimplification necessitated by a lack of algorithms and quantitative parameters for fitting to more realistic tumor evolution models. We propose a framework for inferring models of tumor progression including variable rates for different gain and loss events. Application of the phylogenies inferred by our algorithms to real cervical and breast cancer data identifies key genomic events in disease progression consistent with prior literature. Classification experiments on cervical and tongue cancer datasets lead to improved prediction accuracy for the metastasis of primary cervical cancers and for tongue cancer survival.
Jan. 30, 2015
Herve Tettelin, University of Maryland
Comparative and functional genomics of Mycobacterium massiliense serial clinical isolates
The Mycobacterium abscessus group of bacteria is the most common cause of rapidly growing mycobacterial (RGM) pulmonary infections. The M. massiliense subspecies within this group is increasingly recognized as an emerging pathogen. We used bioinformatics and experimentation to characterize DNA changes in M. massiliense isolates that correlate with disease progression. We gathered clinical isolates from 3 patients with pulmonary M. massiliense infection: two with cystic fibrosis (CF) and one with idiopathic bronchiectasis (IB). Isolates were collected throughout the patients’ disease courses, from chronic stable to terminal stages. Both CF patients had rapid clinical declines after 4 to 6 years of relatively stable disease. The IB patient had a gradual decline over a decade prior to her death. Later isolates showed slower growth rates than earlier ones in all three patients. We also observed appearance of antibiotic resistance and a smooth to rough transition in colony morphology in the later isolates from the patients. We identified genomic changes that coincided with these changes and performed experiments to gain insights into how these changes affected gene function. Our approach provides insights into how M. massiliense adapts to the human lung over a prolonged course of infection. The multiple genomic mutations that accumulate over time likely reflect a combination of the M. massiliense response to host immunity, therapeutics, and/or interactions with diseased lung microbiota. Our results will identify new critical pathways for treatment of M. massiliense infections.
Jan. 23, 2015
Steven Salzberg, Johns Hopkins University
Transcriptome Assembly: Computational Challenges of Next-Generation Sequence Data
Next-generation sequencing technology allows us to peer inside the cell in exquisite detail, revealing new insights into biology, evolution, and disease that would have been impossible to discover just a few years ago. The enormous volumes of data produced by NGS experiments present many computational challenges that we are working to address. In recent years, my lab has developed multiple systems for sequence analysis, including the widely-used Bowtie, TopHat and Cufflinks programs for alignment and assembly of transcipts from RNA-seq data. In this talk, I will discuss two new systems: (1) the HISAT system for spliced aligment of NGS reads, a successor to TopHat; and (2) the StringTie program for assembly and quantitation of RNA-seq data, a successor to Cufflinks. This talk describes joint work with Daehwan Kim and Mihaela Pertea.
Jan 22, 2015
Judith Klein-Seetharman, University of Warwick
Molecular Motivators for Lifestyle Behavior Changes
Our lives are full of habits, good ones (example: exercise) and bad ones (example: eating unhealthy food). The imbalance in these habits is particularly evident in the world-wide prevalence of obesity. It is widely proven that many diseases such as cancer, diabetes, heart disease, and depression are strongly influenced by these habits. Shifting the balance between bad and good habits can therefore prevent disease and enhance well-being. Here, we propose to monitor urine insulin levels to provide people with weight loss intent with molecular feed-back on their metabolic state. The idea is to borrow the body’s own molecules used in internal communicating to assist individuals externally in the conscious struggle to promote healthy life-style changes. To this end, we have developed a mobile health platform available at https://agper.lnx.warwick.ac.uk/mobileHealth-web/. We currently provide the capabilities for a user to log five types of events (food, activity, weight, urine, ketostix). We have used the platform to conduct several experiments in collecting urine samples while varying food type (low carb, normal and ketogenic diets), timing of food intake, and variation across and within individuals. Urine insulin values were measured using immunosandwich electrochemiluminescence detection. Comparison of the insulin data with the food intake and exercise information indicated that unlike blood glucose, urine insulin levels are highly sensitive to changes in diet and activity. We observed urine insulin profiles characteristic of each diet. Therefore, such measurements could be useful to health-care professionals in monitoring adherence to recommended life-style changes and to individuals in obtaining feedback on their metabolic responses to food intake.
Junhyong Kim, University of Pennsylvania
Single Cell Variation and Cellular Phenotype
Recently, single cell RNA sequencing has revealed large variations in the molecular states of individual cells of seemingly the same type. We have been investigating single cell biology for the past five years and have collected over 1000 datasets from various organisms including human, mouse, rat, zebrafish, etc. Here, I will discuss some of the technical aspects of single cell RNA sequencing, our analysis of five different mouse cell types, and then noise and technical resolution problems with single cell transcriptome profiling. I will conclude with a discussion of the origins of single cell variation, suggesting that individual cells are more like individuals of an ecological community rather than uniform modular units.
Robert E. Kass, Carnegie Mellon University
Problems in the Analysis of Spiking Neuron Networks
Knowledge about the link between brain and behavior rests, in large part, on electrophysiological investigation of neural activity recorded from one or more electrodes that have been inserted into the brain of an animal. Technological advances have provided vastly improved data collection and storage capabilities, which present both opportunities and challenges. It is now common to record from dozens to hundreds of electrodes simultaneously, and it is also possible for these electrodes to maintain their position well enough to record the same neurons across hours or even days. Because many disorders, such as ADHD, autism, and schizophrenia, as well as stroke and various neurodegenerative diseases, are thought to involve dysfunction of network connectivity, a great hope has been that multi-electrode recording could reveal the way network activity evolves in healthy and diseased states, and thereby supply an important mechanistic description of pathophysiology. However, while the number of recording electrodes used in a single brain has been increasing exponentially fast, statistical methods for handling the complexity of multi-electrode data have lagged behind. In addition to the general problem of handling large-scale electrode recordings, a second major challenge comes from the striking observation that neural interactions occur at multiple timescales, including those involving oscillations and synchrony (the tendency of two or more neurons to fire at nearly the same time), which could provide an essential mechanism of neural network information flow and be a marker that distinguishes normal from diseased states.
Neurons communicate through rapid electrical discharges known as “spikes,” and sequences of spikes are known as “spike trains.” Because each spike occurs over the course of roughly 1 millisecond while behavior occurs over hundreds of milliseconds, it is reasonable to consider a spike train to be a stochastic sequence of isolated points in time, i.e., a point process. I will review the use of point processes to represent interactions of multiple neurons across different timescales. I will also go over a new method that is applicable to many network analyses: false discovery rate regression.
Laxmi Parida, IBM T.J. Watson Research Center
Population Genomics under the lens of Random-Graphs
The modeling of evolutionary dynamics of populations as random graphs offers a new direction of exploration. We also introduce the notion of a minimal descriptor of an Ancestral Recombination Graph (ARG) that can be used for measuring redundancy as well as extent-of-reconstructability of ARGs. I will discuss how we have used ARGs, constructed from extant samples (using a pipeline called IRiS), to address many fascinating questions ranging from human migration paths, to genetic diversity study in plant cultivars. The combinatorial viewpoint also paves the way for extremely fast, as well as accurate, ARG sampling algorithms (called SimRA). I will conclude with a discussion of our exploration of persistent homology on ARGs to study admixture in populations: both on SimRA samples and plant cultivars.
Hector Corrada Bravo, University of Maryland
Exploring tumor epigenetic heterogeneity by cell-specific methylation pattern reconstruction
DNA methylation aberrations are now known to, almost universally, accompany the initiation and progression of cancers. In particular, the colon cancer epigenome contains specific genomic regions that, along with differences in methylation levels with respect to normal colon tissue, also show increased epigenetic and gene expression heterogeneity at the population level, i.e., across tumor samples, in comparison to other regions in the genome. Tumors are highly heterogeneous at the clonal level as well, and the relationship between clonal and population heterogeneity is poorly understood. We present an approach that uses sequencing reads from high-throughput sequencing of bisulfite converted DNA to reconstruct heterogeneous cell populations by assembling cell type-specific methylation patterns. Our methodology is based on the solution of a specific class of minimum cost network flow problems. We use our methods to analyze the relationship between clonal heterogeneity and population heterogeneity in high-coverage data from multiple samples of colon tumor and matched normal tissues.
Ge Yang, Carnegie Mellon University
Image-based computational analysis of regulatory mechanisms of axonal transport
Neurons are structurally and functionally polarized cells. A hallmark of their polarized structure is the thin and long axon, which can extend at micrometer diameters for up to a meter in humans. Active transport of materials such as proteins and organelles within the axon, a process referred to as axonal transport, is essential to the differentiation, survival, and function of neurons. Axonal transport defects have been strongly implicated in many human neurodegenerative diseases such as Alzheimer’s disease. In this presentation I will introduce recent work of my lab on integrating engineering, computational, biophysical, and cell biological methods to understand how axonal transport is regulated to ensure that the right cargo is delivered to the right destination at the right time. I will start with a brief overview of the image-based computational analysis methods we developed for characterizing spatiotemporal dynamics of axonal transport. I will then focus on presenting results of applying these methods to analyze the regulatory mechanisms axonal transport. Lastly, I will briefly introduce some ongoing work on developing techniques for high-throughput analysis and active control of axonal transport.
Julia Zeitlinger, Stowers Inst. for Medical Research
Genome-wide approaches to understand gene regulation during development
Our long-term research goal is to understand and predict gene regulation based on DNA sequence information and genome-wide experimental data. Using Drosophila development as model system, we discovered that paused RNA polymerase II is frequently found at developmental control genes and is recruited over developmental time to prepare genes for activation. Likewise, enhancers are often bound by transcription factors, yet are not active in the respective tissue. Our genome-wide analyses show that this can be the result of repression during pattern formation and suggest a model for how enhancers function during development.
Arijit Chakravarty, Takeda Pharmaceuticals
An evolving view of cancer: How mathematical modeling helps effective drug development in the face of tumor evolution
The past few years have seen the clear beginnings of a fundamental paradigm shift in Oncology drug development. While Oncogene Addiction postulated a relatively uniform and disease-specific mutational spectrum, the sequencing data has shone light on an entirely different reality. For many indications, tumors are heterogeneous across patients, within patients, and over time. The dazzling complexity and heterogeneity of tumor mutational profiles points to a different process at play- that of stochastic Darwinian evolution.
This new picture provides an explanatory framework for some of the high-profile failures in preclinical-to-clinical translation that have plagued Oncology drug development. Viewing cancer as a stochastic and evolving disease shifts the focus away from identifying patient populations, and towards optimizing the therapeutic window during drug development. The opportunities for computational and mathematical techniques to contribute to drug development are in fact greater, but the focus is different.
In my presentation I will discuss how we use Systems Pharmacology methodologies to translate preclinical data in the design of First-in-Man trials, to make drugs more developable, and to make drug discovery and development more efficient. I will also discuss the ways in which Evolutionary Systems Biology modeling can be used to provide a more realistic set of model systems and methodologies to study heterogeneity and evolution in populations of cancer cells.
11/15/2013 – 11am – 6115 GHC
Kathryn Roeder, Carnegie Mellon University
Statistics and Genetics Open a Window into Autism
Rare variants identified from DNA sequence, especially de novo loss of function (LoF) mutations, have identified genes involved in risk for autism spectrum disorders (ASD). Multiple de novo LoF mutations in the same gene demonstrate that gene affects risk. De novo mutations occur twofold more often in ASD probands than their siblings, implying that half of the genes hit are risk genes. He et al. (2013) extract more information by using a statistical model, called TADA for Transmission And De novo Association, that integrates data from family and case-control studies to infer the likelihood a gene affects risk. Still, given limited sequence data, can we garner yet more information? Progress has been made as part of a collaborative effort to develop systems biological approaches to understanding ASD pathophysiology. Using ASD risk genes as foci, we hypothesize that genes expressed at the same developmental period and brain region, and with highly correlated co-expression, are functionally interrelated and more likely to affect risk. To find these genes we model two kinds of data: gene co-expression in specific brain regions and periods of development; and the TADA results from published sequencing studies. We model the ensemble data as a Hidden Markov Random Field, in which the graph structure is determined by gene co-expression and the model combines these interrelationships with node-specific observations: gene identity; expression; genetic data; and whether it affects risk, which will be estimated. This analysis identifies ≈100 genes that plausibly affect risk, many novel and others implicated despite relatively weak genetic evidence. We will describe how these results can be used to expand our understanding of the genetics of ASD (e.g., nominating genes for targeted sequencing in new samples) and ASD neurobiology.
Ziv Bar-Joseph, Carnegie Mellon University
Reconstructing dynamic regulatory networks in development and disease
Transcriptional gene regulation is a dynamic process and its proper functioning is essential for all living organisms. By combining the abundant static regulatory data with time series expression data using an Input-Output Hidden Markov model (IOHMM) we were able to reconstruct a dynamic representations for these networks in multiple species. The models lead to testable temporal hypotheses identifying both new regulators and their time of activation. We have recently extended these methods to allow the modeling of various aspects of post-transcriptional regulation including temporal regulation by microRNAs and linking signaling and dynamic regulatory networks. The reconstructed networks link receptors and proteins that directly interact with the environment to the observed expression outcome. I will discuss the application and experimental validation of predictions made by our methods focusing on stress response in yeast, lung development in mice. and human flu response. I would also mention a number of other extensions which we have used to study disease progression and the regulation of immune response.
Dr. Subra Suresh, President, Carnegie Mellon University
Crossing Boundaries, Transforming Lives: The Study of Human Diseases at the Crossroads of Engineering, Science, and Medicine
In the last decade, developments in life sciences, nanotechnology, genomics, imaging, computational biology, and micro-fabrication technology have created unprecedented opportunities to study human health and diseases at the cell and molecular levels. Dr. Subra Suresh will present a lecture that provides an overview of his cross-disciplinary research into infectious diseases, hereditary blood disorders and certain types of cancer.
Matthew Stephens, University of Chicago
Assessing association between genetic variants and multiple phenotypes
In many ongoing genome-wide association studies, multiple related phenotypes are available for testing for association with genetic variants. In most cases, however, these related phenotypes are analysed independently from one another. For example, several studies have measured multiple lipid-related phenotypes, such as LDL-cholestrol, HDL-cholestrol, and Triglycerides, but in most cases the primary analysis has been a simple univariate scan for each phenotype. This type of univariate analysis fails to make full use of potentially rich phenotypic data.
While this observation is in some sense obvious, much less obvious is the right way to go about examining associations with multiple phenotypes. Common existing approaches include the use of methods such as MANOVA, canonical correlations, or Principal Components Analysis, to identify linear combinations of outcome that are associated with genetic variants. However, if such methods give a significant result, these associations are not always easy to interpret. Indeed the usual approach to explaining observed multivariate associations is to revert to univariate tests, which seems far from ideal.
In this work we outline an approach to dealing with multiple phenotypes based on Bayesian multivariate regression. The method attempts to identify which subset of phenotypes is associated with a given genotype. In this way it incorporates the null model (no phenotypes associated with genotype); the simple univariate alternative (only one phenotype associated with genotype) and the general alternative (all phenotypes associated with genotype) into a single unified framework. In particular our approach both tests for and explains multivariate associations within a single model, avoiding the need to resort to univariate tests when explaining and interpreting significant multivariate findings. We illustrate the approach on examples, and show how, when combined with multiple phenotype data, the method can improve both power and interpretation of association analyses.
Wing Wong, Stanford University
Characterization of hESC transcriptome by hybrid sequencing
Although transcriptional and post‐transcriptional events are detected in RNA‐seq data from second generation sequencing (SGS), full‐length mRNA isoforms are not captured. On the other hand, third generation sequencing (TGS), which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine SGS and TGS with a custom‐designed method for isoform identification and quantification to generate a high confidence isoform data set for human embryonic stem cells (hESC).
Pierre Baldi, University of California, Irvine
Deep Learning: Theory, Algorithms, and Biological Applications
Learning is essential for building intelligent systems, whether carbon-based or silicon-based ones. Moreover these systems do not solve complex task in a single step but rather go through multiple processing stages. Hence the question of deep learning, how efficient learning can be implemented in deep architectures. This fundamental question not only impinges on problems of memory and intelligence in the brain, but it is also at the forefront of current machine learning research. In the last year alone, new performance breakthroughs have been achieved by deep learning methods in applications areas ranging from computer vision, to speech recognition, to natural language understanding, to bioinformatics. This talk will provide a brief overview of deep learning, from its biological origins to some of the latest theoretical, algorithmic, and application results. Particular emphasis will be given to the development of learning methods–in the form of recursive neural networks– for structured, variable-size, data, and their applications to the problems of predicting the properties of small molecules and the structure of proteins.
Ivo F. Sbalzarini, Dresden International Graduate School for Biomedicine and Bioengineering
Computational Biology with Particle Methods
Understanding the function of biological systems from the interactions between their constituents requires predictive forward models of hypothetical mechanisms. Given the complexity of biological systems, such forward models are frequently computational, where numerical simulations are used to probe a model’s behavior in regimes where it cannot be solved analytically. We review the key differences between biological and engineering applications of numerical simulations and highlight the main challenges in computational data processing and simulation of biological systems. We propose to exploit the unifying algorithmic framework of particle methods to develop numerical simulations, image-processing, and optimization algorithms that meet the challenges of modern biology. We provide examples from our own work, highlighting how methodological advances in scientific computing have enabled new biological insight and progress in computer science alike. The examples include a self-organizing deterministic particle method for the simulation of multi-scale continuum models, a novel class of stochastic simulation algorithms with reduced time complexity, a domain-specific language for particle methods on heterogeneous parallel computer platforms, and a new class of particle-based image segmentation algorithms. This covers the workflow of image-based systems biology, illustrating several analogies and connections between the different fields involved.
Christine Vogel, New York University
The Ups and Downs of Human Protein Expression Regulation
While transcription regulation has been studied for many years, we now have amounting evidence that the regulation of protein translation and degradation is at least as important in determining protein expression levels. Under normal conditions, for example, transcription and mRNA degradation account for ~30% of gene expression regulation in mammalian cells, while translation and protein degradation account for another 30-40%. We now have extended these studies to systems under perturbation, i.e. cells responding to a stimulus. Using a variety of large-scale methods, we examine the behavior of the mammalian proteome and transcriptome in response to environmental stresses. We have quantified the expression of ~4,000 genes and proteins and are in the process of characterizing different regulatory patterns that we observe. Again, transcription is only half the story.
Shayok Chakraborty, Arizona State University
Batch Mode Active Learning for Multimedia Pattern Recognition
The rapid escalation of technology and the widespread emergence of modern technological equipments have resulted in the generation of humongous amounts of digital data (in the form of images, videos and text among others). This has the expanded the possibility of solving real world problems using computational learning frameworks. However, while gathering a large amount of data is cheap and easy, annotating them with class labels is an expensive process in terms of time, labor and human expertise. This has paved the way for research in the field of active learning. Such algorithms automatically select the salient and exemplar instances from large quantities of unlabeled data and are effective in reducing human labeling effort in inducing classification models. To utilize the possible presence of multiple labeling agents, there have been attempts towards a batch mode form of active learning, where a batch of data instances is selected simultaneously for manual annotation. This talk will cover a basic background of batch mode active learning, some related work and my current research in this domain. Specifically, the following three contributions will be discussed in details: (i) batch mode active learning algorithms based on convex relaxations of an NP-hard integer quadratic programming (IQP) problem, with guaranteed bounds on the solution quality, (ii) an active matrix completion algorithm and its application to solve several variants of the active learning problem (transductive active learning, multi-label active learning, active feature acquisition and active learning for regression) and (iii) a framework for dynamic batch mode active learning, where the batch size and the specific data instances to be queried are selected adaptively through a single formulation, based on the complexity of the data stream in question. These contributions are validated on the face recognition and facial expression recognition problems, which are commonly encountered in real world applications like robotics, security and assistive technology for the blind and the visually impaired.
Vineet Bafna, University of California, San Diego
The breakage fusion bridge and other exotic structural variations: combinatorics and cancer genomics
Cancer genomes are marked by genomic instability and massive rearrangements. Recently, many exotic mechanisms have been proposed as mechanistic explanations for these rearrangements. For example, the breakage-fusion-bridge (BFB) mechanism, proposed over seven decades ago, has seen renewed interest as a source of genomic variability and gene amplification in cancer. Here, we formally model and analyze the BFB mechanism, the first rigorous formulation of the mechanism. Using this model, we show that BFB can achieve a surprisingly broad range of amplification patterns, and describe efficient combinatorial algorithms to characterize patterns consistent with BFB. An extensive analysis of simulated, cell-line, and primary tumor data reveals the existence of BFB. Our results also suggest that BFB may be hard to detect under heterogeneity and polyploidy. Time remaining, we will also discuss other sources of variation (joint work with Shay Zakov, and Marcus Kinsella).
Yongjin Park, Johns Hopkins University
Resolving the Structure and Dynamics of Large-scale Interactome
Community structures are embedded in real-world networks. A set of nodes or edges can be decomposed into fairly homogeneous subsets. In biological network analysis, community structures are considered as functionally coherent modules. For instance, tightly connected sub-networks in a protein-protein interaction network generally correspond to protein complexes. Modules are easily identified in a network of hundreds of nodes by visual inspection or simple pattern searches. However, large-scale network datasets pose significant challenges, not only in computation, but also in its completely different properties.
In this talk, I will describe our attempts to solve community-finding problems on genome-scale interactome datasets. I will explain how a probabilistic framework can help design simple yet powerful algorithms, for instance, avoiding “resolution-limits”, and how this framework can extend to dynamic network analysis. Next, I will talk about a newly designed inference algorithm, which is applicable to ultra large-scale hierarchical stochastic block models. We propose a nearly linear time algorithm that can efficiently estimate maximum a posteriori on a deep hierarchical block structure. Moreover, I will show how we combined this hierarchical model with other sources of heterogeneous biological evidence, such as RNA-seq measurements and pathway annotations.
Bo Li, University of Wisconsin at Madison
Computational Analysis of RNA-Seq Data in the Absence of A Sequenced Genome: From Transcript Quantification to De novo Transcriptome Assembly Evaluation
RNA-Seq technology has revolutionized the way we study transcriptomes. In particular, it has enabled us to investigate the transcriptomes of species that have not yet had their genomes sequenced. I will discuss our work on two computational tasks that are crucial to analyzing RNA-Seq data in the absence of a sequenced genome: transcript quantification and de novo transcriptome assembly evaluation. For transcript quantification, RNA-Seq is considered as a more accurate replacement for microarrays. However, to allow for the highest accuracy, methods for analyzing RNA-Seq data must address the challenge of handling reads that map to multiple genes or isoforms. We present RSEM, a generative statistical model of the sequencing process and associated inference methods, which tackles this challenge in a principled manner.
Our results on both simulated and real data sets suggest that RSEM has superior or comparable performance to other quantification methods developed at the same time. Building off of RSEM, we have developed a novel probabilistic model based method, RSEM-EVAL, for evaluating de novo transcriptome assemblies from RNA-Seq data without the ground truth. Our results on both simulated and real data sets show that our RSEM-EVAL metric correlates well with the ground truth accuracies of the assemblies. Our metric has a broad range of potential applications, such as selecting assemblers, optimizing parameters for an assembler and guiding new assembler design.
A. Ercument Cicek, Case Western Reserve University
ADEMA: An Algorithm to Determine Expected Metabolite Level Alterations Using Mutual Information
Sitting on the top of the omics hierarchy, metabolomics is an important platform to understand the changes in the physiological activity due to a condition. Despite the advancements in the analytical methodology and increasing number of genome scale metabolic networks of the organisms, current techniques that are used to make sense out of metabolic profiles are quite limited. The objective of this presentation is (1) to address the shortcomings of the current techniques, which are used for analyzing changes in metabolite levels, and (2) to describe ADEMA, a multivariate method that computes the expected metabolite level changes using the metabolic network topology and mutual information. Results show that (1) ADEMA’s prediction on alteration of De Novo Lipogenesis pathway in Cystic Fibrosis mouse model conforms to independently performed flux and gene expression analyses, and (2) ADEMA’s classifier scheme outperforms other well-known classification algorithms.
Zia Kahn, University of Chicago
Quantitative Proteomics Provides a New Window into How Genetic Differences Impact Protein-Levels Between Species
Understanding how genetic differences affect a phenotypic variation within and between species is a central goal of evolutionary and medical genetics. Genetic differences that impact the regulation of a gene are key contributors to trait differences. Yet, identifying genetic differences that impact gene regulation is challenging: not all genetic differences are functional and gene regulation is the result of a complex network of interactions between genes. Measuring differing levels, or differential allele-specific expression, of gene products, RNA or protein, from two variants of a gene in the same individual provides direct evidence that a DNA sequence difference between these variants impacts their regulation. This measurement sets the stage for further studies to pinpoint functional genetic variation. While recent technological advances have made it possible to measure allele-specific RNA expression across many genes in high-throughput, the same cannot be said for protein levels. As proteins carry out much of the work of the cell, the absence of a corresponding protein measurement leaves a gap in our understanding of the genetic basis of phenotypic variation. I present a quantitative, computation method for measuring differential expression of two protein variants in an individual. The computational method is based on a simple observation that overcomes a key limitation of a data-intensive, or “big data,” technology in biological sciences called quantitative mass spectrometry. As a proof of concept, I use this computational method to study allele-specific protein levels in a hybrid between two distantly related species of yeast. This study demonstrates how this computation method provides a new window into how two classes of genetic differences have impacted protein levels between species.
Luisa Hiller, Carnegie Mellon University
Genomic Plasticity: To Be or Not to Be
The gram positive bacteria Streptococcus pneumoniae, colonizes humans as a nasopharyngeal commensal or a respiratory pathogen. This species displays extensive genomic diversity and a notable capacity to incorporate genes from neighboring cells into their genomes producing new genomic combinations. Yet, the majority of pandemic multi-drug resistant strains belong to one of several lineages that displays decreases genomic diversity. In this talk I will discuss the genomic diversity and plasticity in the population, as well as possible barriers to gene exchange that may be leading to the genomic isolation of clinically important lineages.
Joel McManus, Carnegie Mellon University
Evolution of post-transcriptional gene regulatory networks
Differences in gene expression are an important source of phenotypic variation and disease. Gene expression differences result from changes in gene regulatory networks, principally comprised of cis-acting sequences and trans-acting factors. These networks control numerous processes, including transcription, alternative splicing, and translation of mRNA into protein. Research over the past decade revealed that changes in trans-acting factors are responsible for most mRNA abundance differences within species, while changes in cis-regulatory sequences accumulate between species. In contrast, much less is known regarding how alternative splicing and mRNA translation regulatory networks evolve. We used high throughput sequencing of cDNA libraries from multiple Drosophila species to investigate the evolution of alternative splicing. Our results suggest that regulation of alternative splicing diverges more rapidly in non-coding regions than in coding regions, and that frame shifting alternative splicing events have more conserved regulation. We further investigated the contributions of cis- and trans-acting changes in splicing regulatory networks by comparing allele-specific splicing in F1 interspecific hybrids. In F1 nuclei, each allele is subjected to the same set of trans-acting factors. Thus differences in allele-specific splicing reflect changes in cisÂ-regulatory element activity. Changes in cis-regulatory elements contribute more to species-specific differences in intron retention and alternative splice site usage, while changes in trans-acting factors contribute more to species-specific exon skipping differences. These results suggest important differences in the regulatory network architecture among classes of alternative splicing. We are also studying the evolution of mRNA translation using allele-specific ribosome profiling. Our preliminary results suggest that translation regulatory networks may buffer species-specific mRNA abundance differences in budding yeast.
Eric Schadt, Mt. Sinai School of Medicine
Moving towards a better understanding of human disease in the era of big data
Common human diseases and drug response are complex traits that involve entire networks of changes at the molecular level driven by genetic and environmental perturbations. Changes at the molecular level can induce changes in biochemical processes or broader molecular networks that affect cell behavior, and changes in cell behavior can affect normal tissue or whole organ function, eventually leading to pathophysiological states at the organism level that we associate with disease. While the vast majority of previous efforts to elucidate disease and drug response traits have focused on single dimensions of the system, achieving a more comprehensive view of common human diseases requires examining living systems in multiple dimensions and at multiple scales. Studies focused on identifying changes in DNA that correlate with changes in disease or drug response traits, changes in gene expression that correlate with disease or drug response traits, or changes in other molecular traits (e.g., metabolite, methylation status, protein phosphorylation status, and so on) that correlate with disease or drug response are fairly routine and have met with great success in many cases. However, to further our understanding of the complex network of molecular and cellular changes that impact disease risk, disease progression, severity, and drug response, we can more formally integrate these different data dimensions. Here I present an approach for integrating a diversity of molecular and clinical trait data to uncover models that predict complex system behavior. By integrating diverse types of data on a large scale I demonstrate that some forms of common human diseases like diabetes are most likely the result of perturbations to specific gene networks that in turn causes changes in the states of other gene networks both within and between tissues that drive biological processes associated with disease. These models elucidate not only primary drivers of disease and drug response, but they provide a context within which to interpret biological function, beyond what could be achieved by looking at one dimension alone. That some forms of common human diseases are the result of complex interactions among networks has significant implications for drug discovery: designing drugs or drug combinations to impact entire network states rather than designing drugs that target specific disease associated genes.
Carl Kingsford, Carnegie Mellon University
Computational Challenges Comprehending Chromosome Conformation Capture Constraints
The physical shape and arrangement of chromosomes in the cell affects gene expression, long-range regulation of transcription, and genome evolution (particularly biasing which rearrangements occur), and it has been implicated in the development of several types of cancers. New high-throughput experimental techniques derived from “chromosome conformation capture” (3C) have produced measurements that hint at the spatial proximity of regions of the genome as it is arranged in the cell.
I will describe our work in three directions to make this 3C-like data more confidently useful for correlating structure with biological function. First, I will describe a new approach called metric filtering for discarding false-positive proximity measurements that selects edges to keep based on both their surprising observation counts and their metric consistency with other selected edges. We show this technique keeps more information and produces three-dimensional models that agree better with observations from light microscopy.
Second, I will discuss an approach based on rigidity theory to decide whether a 3C experiment has generated sufficient constraints to determine a structure. We find in fact that current experiments provide far more than enough constraints to determine a non-floppy structure for most of the genome in several organisms. As a byproduct, we produce a more practical algorithm for large-scale testing of rigidity.
Finally, I will discuss improved techniques for finding statistically significant correlations between genomic features and spatial proximity that avoid the computationally demanding and error-prone step of deriving a three-dimensional structure.
Various aspects of this research was done jointly with Geet Duggal, Hao Wang, Darya Filippova, Rob Patro, Emre Sefer, Sridhar Hannenhalli (UMD), and Michelle Girvan (UMD).
Curtis Huttenhower, Harvard University
Bug bytes: bioinformatics for meta’omics and microbial community analysis
Among many surprising insights, the genomic revolution has helped us to realize that we’re never alone and, in fact, barely human. For most of our lives, we share our bodies with some ten times as many microbes as human cells; these are resident in our gut and on nearly every body surface, and they are responsible for a tremendous diversity of metabolic activity, immunomodulation, and intercellular signaling.
These microbial communities have only recently become well-described using high-throughput sequencing, requiring analyses that simultaneously apply techniques from genomics, “big data” mining, and molecular epidemiology. I will discuss emerging end-to-end bioinformatics approaches for metagenomics and metatranscriptomics, including handling of sequence data for mixed microbial communities, its reconstruction into metabolic pathways, and biomarker discovery in disease. In particular, computational processing is key in identifying unique markers for microbial taxonomy, phylogeny, and in identifying genes and pathways significantly disrupted in inflammatory conditions such as Crohn’s and ulcerative colitis.
Christoph Wuelfing, UT Southwestern Medical Center
Spatiotemporal organization of lymphocyte signaling systems as a regulator of function
The subcellular organization of the T cell signaling system, similar to that of many other cell types, is highly diverse in time and space. Using systems-scale imaging of T cell signaling, we analyze such organization togain unique insight into T cell function with an emphasis on T cell actin regulation.
Tom Bartol, Salk Inst. for Biological Studies
How to Build a Synapse from Molecules, Membranes, and Monte Carlo Methods
Mark Gerstein, Yale University
Analysis of Molecular Networks
My talk will be concerned the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimension browser tracks. Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).
Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks. KK Yan, G Fang, N Bhardwaj, RP Alexander, M Gerstein (2010). Proc Natl Acad Sci U S A 107:9186-91.
Analysis of diverse regulatory networks in a hierarchical context shows consistent tendencies for collaboration in the middle levels. N Bhardwaj, KK Yan, MB Gerstein (2010). Proc Natl Acad Sci U S A 107:6841-6.
Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context. PM Kim, JO Korbel, MB Gerstein (2007). Proc Natl Acad Sci U S A 104:20274-9.
The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. KY Yip, H Yu, PM Kim, M Schultz, M Gerstein (2006). Bioinformatics 22:2968-70.
Pavel Sumazin, Columbia Medical Center
RNA regulatory networks help propagate the effects of genetic alterations
Biomedical researchers profile DNA and chromatin of large patient cohorts in an attempt to identify common alterations that drive pathology and can point to diagnostic and therapeutic biomarkers. Increasingly, however, it is clear that genetic and epigenetic alterations can regulate pathology combinatorially, and that different combinations of alterations may generate the same phenotype. To make full use of molecular profiling data, we need to understand how alterations affect cellular programs.
I will describe two new types of computationally predicted post-transcriptional regulatory networks. Computational and experimental evidence suggest that interactions in these networks may alter the expression of known drivers of high-grade glioma. I will describe regulators of microRNA activity, which modify the activity of microRNAs without necessarily altering their expression. These regulators may channel the effects of genomic deletions to distally downregulate established tumor suppressors. Conversely, post-transcriptional regulators of microRNA biogenesis alter the expression of known drivers of gliomagenesis by regulating the abundance of the microRNAs that target them. Alterations to these regulators lead to widespread changes to the expression of microRNAs that target known drivers of glioma.
Taken together, our results suggest that post-transcriptional regulation in the cell is both extensive and complex. We present evidence that genetic and epigenetic alterations may be amplified and propagated by post-transcriptional interactions to affect both disease initiation and outcome. Our work provides some of the building blocks necessary for reverse engineering integrated regulatory networks that will help identify driver alterations and explain their effects on cellular programs and pathology.
Bio: Pavel Sumazin is a research scientist at Columbia Medical Center. He graduated from Stony Brook University with a PhD in computer science with a focus on design and analysis of algorithms. He taught computer science theory at Portland State University, was an NSF fellow in human genetics at Cold Spring Harbor Laboratory, and served as Associate Director for bioinformatics at Columbia University’s Genome Center.
Frank DiMaio, University of Washington
Protein structure determination with sparse and noisy data
Determining the structure of a protein, which involves finding the three-dimensional placement of each of a protein’s thousand of atoms, is an important problem in biochemistry, providing key insights into mechanisms as well as targets for drug design. However, many proteins of biomedical importance elude traditional structure determination methods. For these proteins, sparse data — either experimental or knowledge-based — may provide structural information, though not enough to uniquely determine a solution. The Rosetta structure prediction methodology uses an energy-based approach to explore physically feasible protein conformations. By combining this energy function with sparse data, I can quickly infer high-accuracy protein models. I will describe the effectiveness of this approach using data from four different sources. First, I will show how we may use cryo-electron microscopy density data, which provides a very coarse envelope function describing the protein shape to infer models that accurately recapitulate high-resolution details. I will describe how a similar approach may be used to solve difficult molecular replacement problems. Here, sparse data is confounded with significant noise; nonetheless, my approach led to the solution of thirteen protein structures, previously unsolved in the hands of expert crystallographers. Similarly, using only low-resolution crystallographic data, my approach recapitulates high-resolution details that are not captured by current refinement methods. Finally, I will describe recent breakthroughs I have made in homology modeling, where the source of data is not from experiment, but instead from previously solved protein structures. I will additionally show how these methods are broadly applicable using both experimental and statistical sources of data, with implications for both protein structure determination and design.
Zhengqing Ouyang, Stanford University
Statistical modeling of next generation sequencing data for global gene regulation
Unraveling the global regulation of gene expression is essential for understanding embryonic development and human diseases. Gene expression is regulated at multiple levels, including transcription, RNA processing, and translation. At each level, regulators such as transcription factors, RNAs, and RNA-binding proteins are forming complex regulatory networks. Recent advances in high-throughput technologies, including next generation sequencing, provide unprecedented opportunities to profile multiple levels of gene regulatory information. In this talk, I will describe statistical methods for integrating next generation sequencing data to discover the principles of global gene regulation. At the transcriptional regulation level, a joint model of ChIP-Seq and RNA-Seq will be introduced. The model effectively quantifies transcription factor regulatory strength, reveals combinatorial regulation, and accurately predicts genome-wide expression levels of genes. At the post-transcriptional level, an integrative approach is proposed to reconstruct RNA secondary structures at the genome-scale from deep sequencing data. I will demonstrate the advantages of our approach and the widespread impact of RNA secondary structure on gene regulation.
Jing Li, Case Western Reserve University
Rare variant discovery and calling by sequencing pooled samples with overlaps
For many complex traits/diseases, it is believed that rare variants account for the missing heritability that cannot be explained by common variants. Sequencing a large number of samples through DNA pooling is a cost effective strategy to discover rare variants and to investigate their associations with phenotypes. Overlapping pool designs provide further benefit because such approaches can potentially identify variant carriers. However, existing algorithms for analyzing sequence data from overlapping pools are limited. We propose a complete data analysis framework for overlapping pool designs, with novelties in all three major steps: variant pool and variant locus identification, variant allele frequency estimation and variant sample decoding. The framework can be utilized in combination with any design matrix. We have investigated its performance based on two different overlapping designs, and have compared it with two state-of-the-art methods, by simulating targeted sequencing. Results show that our algorithm has made significant improvements over existing ones.
Ron Dror, D.E. Shaw Research
How drugs bind and control their targets: characterizing GPCR signaling using Anton, a special-purpose supercomputer for molecular dynamics simulations
Roughly one-third of all drugs act by binding to G-protein-coupled receptors (GPCRs) and either triggering or preventing receptor activation, but the process by which they do so has proven difficult to determine using either experimental or computational approaches. We recently completed a special-purpose machine, named Anton, that uses a combination of novel algorithms and application-specific hardware to accelerate molecular dynamics simulations by orders of magnitude, enabling all-atom protein simulations as long as a millisecond (Science 330:341-6, 2010). Anton has made possible simulations in which drugs spontaneously associate with GPCRs to achieve bound conformations that match crystal structures almost perfectly (PNAS 108:13118-23, 2011; Nature 482:552-6, 2012). Simulations on Anton have also captured transitions of a GPCR between its active and inactive states, allowing us to characterize the mechanism of receptor activation (Nature 469:236-40, 2011; PNAS 108:18684-9, 2011). Our results, together with complementary experimental data, suggest opportunities for the design of drugs that achieve greater specificity and control receptor signaling more precisely.
Hannah Carter, Johns Hopkins University
Identifying driver missense mutations in tumor sequencing data
Large-scale sequencing of cancer genomes is uncovering thousands of DNA alterations, but the functional relevance of the majority of these mutations to tumorigenesis is unknown. Identifying which of these mutations contribute to cancer is critical for understanding tumor biology, and for finding new diagnostic biomarkers and therapeutic targets. We have developed a computational method, called Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM), to identify and prioritize the missense mutations most likely to generate functional changes in proteins that enhance tumor cell proliferation. CHASM uses a supervised machine learning technique called a random forest and more than 80 quantitative features describing amino acid changes to predict candidate driver mutations. The method has high sensitivity and specificity when discriminating between known driver missense mutations and randomly generated missense mutations, and performs well relative to other computational methods applied to this problem. CHASM has been applied to over 15 tumor sequencing studies to prioritize missense mutations for further study and initial results are promising; however, further experimental validation is needed to confirm CHASM predictions.
Jianyang (Michael) Zeng, Duke University
Automated Nuclear Magnetic Resonance Assignment and Protein Structure Determination
High-throughput protein structure determination based on solution nuclear magnetic resonance (NMR) spectroscopy plays an important role in structural genomics. Unfortunately, current NMR structure determination is still limited by the lengthy time required to process and analyze the experimental data. In this talk, I will describe our recent success stories about the applications of computational techniques in addressing several bottlenecks in NMR structure determination. First, I will talk about a novel high-resolution structure determination algorithm that starts with a global fold calculated from the exact and analytic solutions to the residual dipolar coupling (RDC) equations. Our high-resolution structure determination protocol has been applied to solve the NMR structures of the FF Domain 2 of human transcription elongation factor CA150 (RNA polymerase II C-terminal domain interacting protein), which have been deposited into the Protein Data Bank (PDB ID: 2KIQ). Second, I will present a Bayesian approach to determine protein side-chain rotamer conformations by integrating the likelihood function derived from unassigned NOE data, with prior information (i.e., empirical molecular mechanics energies) about the protein structures. Third, I will describe an automated side-chain resonance assignment algorithm that does not require any explicit through-bond experiment to facilitate side-chain resonance assignment. All our algorithms have been tested on real NMR data. The promising results demonstrate that our algorithms can be successfully applied to high-quality protein structure determination. Since our algorithms reduce the time required in NMR assignment, it can accelerate the protein structure determination process.
Roger Pique-Regi, University of Chicago
Understanding the impact of genetic variation on molecular mechanisms of transcriptional regulation
My research focuses on developing novel computational methods to identify regulatory sequences, and to model the molecular mechanisms of gene transcription control. The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation. However, it remains difficult to identify the causal variants underlying eQTLs, and little is known about the regulatory mechanisms by which they act. We used DNase I sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines, for which genome-wide genotypes and estimates of gene expression levels are also available. We obtained a total of 2.7 billion uniquely mapped DNase I-sequencing (DNase-seq) reads, which allowed us to infer transcription factor binding exploiting the specific DNase I cleavage footprint left on 827,000 sites corresponding to more than 100 factors. Across individuals, we identified 8,902 locations at which the DNase-seq read depth correlated significantly with genotype at a nearby locus (FDR = 10%). We call such genetic variants ‘DNase I sensitivity quantitative trait loci’ (dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele- specific changes in transcription factor binding. A substantial number of dsQTLs are also associated with variation in the expression levels of nearby genes. Our observations indicate that dsQTLs are highly abundant in the human genome and are likely to be important contributors to phenotypic variation.
Carl Kingsford, University of Maryland
Computational Challenges in Reconstructing Evolutionary Histories
I will discuss our recent efforts to reveal important evolutionary events in two biological systems.
First, I will describe our work identifying reassortments, or mixing of genomic segments, in the influenza virus. Reassortment is the main process by which new pandemic strains arise and was the driving force behind the recent “swine flu” outbreak. We have developed an algorithm and software program called GIRAF that finds reassortment events among large collections of influenza genomes. GIRAF is the first fully automated computational approach to this problem, and it is based on the first quadratic-delay algorithm for enumerating high-weight maximal bicliques in bipartite graphs. It allows us to quickly scan thousands of influenza genomes for reassortments. Using our algorithm, we have discovered many novel reassortment events in collections of human, avian, and swine influenza strains.
Second, I will present our recent work on reconstructing ancient biological networks. We have developed several methods for recovering interactions between molecules that were present in ancestral species, starting with only the present-day networks that we are able to measure. We have shown that many properties of the evolution of extinct networks can be inferred using our approaches and that ancestral interactions can be inferred with high accuracy.
Various parts of this work were done jointly with Niranjan Nagarajan, Saket Navlakha, Rob Patro, Guillaume Marçais, Justin Malin, and Emre Sefer.
Meera Sitharam, University of Florida
EASAL: Entropy computation for Assembly Configuration spaces via Stratified Convex Parametrizations
Differences between the geometries of molecular assembly versus folding configuration spaces are illuminated by a new theory of convex configuration spaces developed by the speaker’s group. While assembly configurations of molecular complexes of up to 7 rigid monomers are already high dimensional and entropically challenging, they are far more tractable to explore, search and analyze than folding configuration spaces. This is because: (a) the assembly configuration space topology can be decomposed directly into a standard Thom-Whitney complex of active constraint regions, including boundaries of varying dimensions; (b) (the key point) these active constraint regions can be charted with convex parameterizations. We refer to the precisely roadmapped union of these charts as the atlas of the configuration space.
EASAL is the software implementation of various efficient algorithms with proven guarantees for atlasing and related search problems for such small molecular assemblies.
Atlasing the configuration spaces and assembly pathways of larger molecular assemblies is effected by recursive decomposition and recombination as smaller molecular subassemblies (that can be atlased using EASAL) making active use of symmetry often present in larger assemblies.
We have recently used EASAL (a) to correctly predict crucial interactions for the assembly of a T= 1 viral shell of AAV4 (confirmed by mutagenesis experiments in the Mckenna lab at UF) and (b) to illuminate features and configurational entropy of a helix packing configuration space that cause standard metropolis montecarlo sampling to be non-stochastic (helix and montecarlo trajectory data from the lab of Maria Kurnikova, a computational chemist at CMU).
Kevin White, University of Chicago
Integrating Genomic Networks to Identify Biomarkers and Drug Targets
Systems level approaches to construct abstract molecular networks can lead to predictions about genetic and biochemical functions in cells, organisms and in disease states. We have used integrated experimental and computational approach to construct a large scale functional networks in both model organisms and human cancer cells. Our network models are based on a combination of gene expression, transcription factor DNA binding site mapping, automated literature mining and protein-protein interaction mapping. We provide a strategy for reducing the dimensionality of the massive networks that result from such integrated whole genome analyses. I will present examples from both Drosophila and human breast cancer cell lines that illustrate how one can translate systems biology-driven findings in model systems to useful tools for diagnosing human diseases. I will also discuss our use of large scale genome sequence data in the context of systems approaches to developing prognostic signatures for breast cancer, and the use of cloud computing to manage and mine ‘omics data.
Ernest Fraenkel, Massachusetts Inst. of Technology
Integrating ‘Omic’ Data to Reveal Disease Mechanisms
Proteomic technologies, next-generation sequencing and RNAi screens are providing increasingly detailed descriptions of the molecular changes that occur in diseases. However, it is difficult to assemble these data into a coherent picture that could lead to new therapeutic insights for several reasons. Despite their power, each of these methods still only captures a small fraction of the cellular response. Moreover, when different assays are applied to the same problem, they often provide apparently conflicting answers. We have developed powerful new approaches to integrate these data to identify small, functionally coherent pathways that underlie cellular behavior. In this talk, I will discuss recent unpublished work from my laboratory showing that these methods suggest novel therapeutic strategies for glioblastoma multiforme.
Daphne Koller, Stanford University
Twelfth Morris H. DeGroot Memorial Lecture
Wendy Cornell, Merck
Comparison of 2D, 3D, and QSAR Methods for Virtual Screening
Using a set of 47 protein targets from the MDDR, we assess the performance of 2D similarity, 3D similarity, and QSAR methods at identifying active compounds for each target when starting with some number (1, 5, 10, 20, or 40) of actives. Two 2D similarity methods are tested – Toposim, which uses Dice similarity, and Lassi, which uses latent semantic structural indexing. Three QSAR methods are included – random forest, trendvector, and support vector machine (SVM). Each 2D similarity and QSAR method is used in combination with different descriptor sets, including atom pairs (AP), topological torsions (TT), binding property torsions (DT), extended connectivity fingerprints (ECFP4), and MACCS. We assess retrieval rates for single compounds as well as clusters. Among the descriptor sets, ECFP4 performed consistently the best. Although Toposim and Lassi found different hits, their retrieval rates for individual compounds were surprisingly similar. Among the QSAR methods, random forest and trendvector outperformed SVM. Combinations of methods are also explored to maximize both lead hopping and retrieval of close neighbors.
Wei Wu, University of Pittsburgh
Reverse Engineering Dynamic Gene Networks Underlying Breast Cancer Cell Lineages and Yeast Cell Cycles
Estimating gene regulatory networks over biological lineages or time series is central to a deeper understanding of how cells evolve during development and differentiation. One challenge in estimating such evolving networks is that their host cells are not only contiguously evolving, but also can branch over time. For example, a biologist may apply several different drugs to a malignant cancer cell to analyze the changes each drug has produced in the treated cells. Cells treated with one drug are not directly related to cells treated with another drug, but rather to the malignant cancer cells that they were derived from. Underlying these intriguing dynamic systems, one expects that the interactions between genes are not always constant over time, but rather they are often transient; in other words, gene-gene interactions occur during a time interval may disappear and then reappear again later in time. This challenging behavior renders existing network inference methods inapplicable.
We proposed two novel approaches, Treegl and TV-DBN, which build on the L1 plus time-dependent penalized graphical logistic regression to effectively estimate multiple evolving gene networks corresponding to cell types related by a tree-genealogy, or cell stages related by a evolving chain, based on only a few samples from each condition. Our methods take advantage of the similarity between related networks along the biological lineage, while at the same time exposing sharp differences between the networks. We explore applications to analysis of a breast cancer development, and yeast cell cycle regulation. Based on only a few microarray measurements, our algorithms are able to produce biologically valid results that provide insight into the progression and reversion of breast cancer, and transient interactions among genes in yeast cell cycle.
Ioannis Tsamardinos, Vanderbilt University
Towards Integrative Causal Analysis of Heterogeneous Datasets and Prior Knowledge
Modern data analysis methods for the most part, concern the analysis of a single dataset. The conclusions of an analysis are published in the scientific literature and their synthesis is left up to a human expert. Integrative Causal Analysis (INCA) aims at automating this process as much as possible. It is a new, causal-based paradigm for inducing models in the context of prior knowledge and by co-analyzing heterogeneous datasets in terms of measured variables, experimental conditions, or sampling methodologies. INCA is related to, but is fundamentally different from statistical meta-analysis, multi-task learning, and transfer learning.
In this talk, we illustrate the enabling INCA ideas, present INCA algorithms, and give proof-of-concept empirical results. Among others, we show that the algorithms are able to predict the existence of conditional and unconditional dependencies (correlations), as well as the strength of the dependence, between two variables Y and Z never measured on the same samples, solely based on prior studies (datasets) measuring either Y or Z, but not both. The algorithms accurately predict thousands of dependencies in a wide range of domains, demonstrating the universality of the INCA idea. The novel inferences are entailed by assumptions inspired by causal and graphical modeling theories, such as the Faithfulness Condition. The results provide ample evidence that these assumptions often hold in many real systems. The long term goal of INCA is to enable the automated large-scale integration of available data and knowledge to construct causal models involving a significant part of human concepts.
Li-San Wang, University of Pennsylvania
Gene expression in aging and aging-associated disorders
Aging is a highly complex phenomenon that affects virtually all aspects of biology. In medicine, age is a primary risk factor for cancer, neurodegeneration, and many other diseases. Thus, understanding how aging proceeds and contributes to these diseases are key to finding cause and means of intervention. This presentation will cover some of our work towards understanding the connection between aging and age-associated diseases, by investigating gene expression through bioinformatic means.
The first half of my talk focuses on G-quadruplexes (Gquads). Gquads are genomic motifs consisting of four runs of guanines that can form highly stable 3D structures in vivo and have high occurrence in telomeres. Analysis of yeast and human genomic distributions suggest that Gquads are associated with differentially expressed genes in yeast senescence model and human fibroblasts from patients with Werner syndrome, a genetic disorder that exhibits premature aging phenotypes.
The second half of my talk concerns gene expression changes in human brain aging and Alzheimer’s disease. We developed algorithms that can estimate the age of an individual using gene expression profiles. Using these algorithms, we found that brains with Alzheimer’s disease or frontal temporal dementia show trends of accelerated aging in gene expression change.
Tamer Kahveci, University of Florida
Computational strategies for understanding how biological networks function.
Biological networks of an organism show how different bio-chemical entities, such as enzymes or genes, interact with each other to perform vital functions for that organism. Each subnetwork within a network can perform various functions that it can not do without interacting with other entities in the network. Understanding the functions of the entire networks as well as the individual subnetworks has been a prime goal for explaining how the organisms work.
Dr. Kahveci’s lab is focusing on developing computational methods that will help in understanding the functions of large scale biological networks. This talk we will focus on comparative analysis of biological networks. This topic will be considered in two parts. The first part will focus on comparative analysis of a pair of networks. This part will constitute the majority of the talk. The second part will discuss scalabilities issues for performing this analysis on a large database of networks. The first part will guide step by step starting from a simplified model to a more realistic model. The first step will limit comparison to pairs of entities of networks and explain how we can compare networks when the biological process is explained through different types of biological entities. The second step will eliminate this limitation and describe a computational approach when the same biological process can be performed at different number of steps. The last step will challenge the existing definition of similarity and introduce a new measure, functional similarity that explains the function in terms of the steady states of the biological networks and describe how we can compute the steady states for large regulatory networks. The second part of the talk will discuss a probabilistic strategy for finding highly similar networks to a query network in a database that contains a large number of networks.
Gerald Quon, University of Toronto
De-mixing heterogeneous gene expression profiles into their constituent components, and applications to personalized medicine
One of the primary goals of gene expression profiling experiments is to identify key genes and pathways associated with a particular condition or disease. However, biological samples are often composed of multiple distinct cell populations, of which only a few are of interest. We have developed ISOLATE, a computational model for separating heterogeneous mixtures of cell populations into their individual components, given only the expression profiles of heterogeneous samples and some of the homogeneous populations. We demonstrate the accuracy and value of computational purification in three problem domains: identifying prognostic signatures for cancer, linking changes in gene expression to patient outcome in juvenile arthritis, and monitoring cell population dynamics in hematopoietic stem cell systems.
Xin He, University of California, San Francisco
Understanding genetics of complex diseases through systems biology and regulatory genomics
Genome-wide association studies (GWAS) have identified many candidate loci for a number of complex traits. In most cases, however, there is little functional evidence of these loci and the mechanisms of their influence on complex traits are not clear. A very promising strategy is to link genotypes and phenotypes through molecular level traits, such as gene expression level. The first part of my talk will be focused on a new strategy we developed recently to incorporate expression QTL (eQTL) data in the analysis of GWAS. We developed a Bayesian statistical method that integrates the information of the SNPs underlying a gene expression trait, with appropriate weighting, to test if the expression of this gene contributes to the complex disease of interest. In particular, our statistical test allows us to exploit information in a large number of weak SNPs, which are often ignored but represent a collectively important part of the genetics of any complex trait.
To ultimately understand how genotypic variations influence phenotypes, we need a detailed understanding of how DNA sequences encode their immediate molecular functions. The second part of my talk will be focused on the study of regulatory sequences, which harbor a large fraction of SNPs discovered by GWAS and are believed to be important for many complex diseases. To recognize these sequences in a genome, we developed a comparative genomic method based on the assumption that functional transcription factor binding sites (TFBSs) tend to be conserved across species. The method utilizes a probabilistic model of regulatory sequence evolution that captures substitutions, insertions/deletions, and selection or turnover of TFBSs. A more difficult question of regulatory sequences is to understand how these sequences generate spatial-temporal expression patterns. For this purpose, we developed a quantitative model based on statistical thermodynamics theory and an efficient dynamic programming algorithm. This model incorporates a number of features of regulatory sequences, including the importance of weak TFBSs, cooperative interactions among TF molecules, among other things. We demonstrated the predictive power of our model, and by applying it to an early developmental system in Drosophila, we were able to gain understanding of the quantitative rules of gene regulation.
Richard H. Lathrop, University of California, Irvine
Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning
Many protein engineering problems involve finding mutations that produce proteins with a particular function. Most Informative Positive (MIP) active learning is tailored to biological problems because it seeks novel and informative positive results. We applied MIP to discover mutations in the tumor suppressor protein p53 that reactivate mutated p53 found in human cancers. MIP found Positive (cancer rescue) p53 mutants in silico using 33% fewer experiments than traditional non-MIP active learning. MIP was used to select a Positive Region predicted to be enriched for p53 cancer rescue mutants. In vivo assays showed that the predicted Positive Region: (1) had significantly more (p<0.01) new strong cancer rescue mutants than control regions (Negative, and non-MIP active learning); (2) had slightly more new strong cancer rescue mutants than an Expert region selected by a human expert for purely biological considerations; and (3) rescued for the first time the previously unrescuable p53 cancer mutant P152L.
Anne E. Carpenter, Broad Inst. of Harvard and MIT
Extracting quantitative information from biological images to tackle world health problems
Microscopy images contain rich information about the state of cells and organisms and are an important part of experiments to address a multitude of basic biological questions and world health problems. Our laboratory works on image analysis and data mining, primarily for high-throughput screening experiments. These experiments test thousands of chemical or genetic perturbations in order to identify the causes and potential cures of disease. Machine-learning approaches, guided by a biologists’ intuition, have been particularly successful for measuring subtle and complex phenotypes in these experiments.
The biological systems being tested in high-throughput experiments are becoming increasingly more physiologically relevant. For example, co-cultures of two particular cell types can better replicate certain tissue and organ systems and preserve normal cellular functions such as liver and hematopoiesis. Whole organisms like C. elegans and zebrafish can be screened for complex phenomena like behavior, infection, and metabolism. These more complex systems present new challenges in image analysis.
We are also exploring the potential of extracting patterns of morphological perturbations (“signatures”) from cell images in order to identify the similarities between various chemical or genetic treatments, in experiments to identify distinctions between human isoforms of cancer-relevant proteins, mechanisms of hepatotoxicity, and diagnostics for bipolar disorder and schizophrenia.
The methods we develop are freely available through the biologist-friendly open-source software, CellProfiler, for both small- and large-scale experiments.
Jinbo Xu, Toyota Technological Institute at Chicago
Probabilistic Graphical Model for Protein Structure Prediction
If we know the primary sequence of a protein, can we predict its three-dimensional structure by computational methods? This is one of the most important and difficult problems in computational molecular biology and has tremendous implications for protein functional study and drug discovery.
Existing computational methods for protein structure prediction can be broadly classified into two categories: template-based modeling (i.e., protein threading/homology modeling) and template-free modeling (i.e., ab initio folding). Template-based modeling predicts structure of a protein using experimental structures in the Protein Data Bank (PDB) as templates while template-free modeling predicts protein structure without depending on a template.
This talk will present new probabilistic graphical models for knowledge-based protein structure prediction. In particular, this talk will present a regression-tree-based Conditional Random Fields (CRF) method for template-based modeling and a Conditional Random Fields/Conditional Neural Fields (CRF/CNF) method for template-free modeling. Experimental results indicate that our template-based method performs extremely well, especially on hard template-based modeling targets and our template-free method is also very promising for mainly-alpha proteins.
Russell Malmberg, University of Georgia
Computational Searches for Non-Coding RNA; Ecological Genetics of Pitcher Plants
There will be two quite different research topics presented.
(1) The importance of RNAs that do not code for proteins, but that have functions directly as RNAs, has been recognized over the last 30 years in a series of dramatic discoveries. Estimates of the numbers of non-coding RNAs in eukaryotes vary considerably but are plausibly in the range of 0.5x to 2x the number of protein-coding RNAs. Computational identification of ncRNA genes in genomes is rendered difficult by the lack of sequence similarity of many related ncRNAs; however, some ncRNAs have their structure more conserved than their primary sequence. We have developed algorithms to search genomes for ncRNA on the basis of their structure, using conformational graph – tree decomposition methods to greatly speed up the process. We have studied the nature of evolutionary variability in RNA secondary structure, and are using these results to improve the genomic search methods.
(2) Pitcher plants (Sarracenia species) eat insects. They appeal to the inner 10 year old in us. Different Sarracenia species have varying pitcher morphologies and varying means of digesting insects. Some species actively digest secreting proteases and similar enzymes; other species support a microbial food-web which digests the insects. We are analyzing the genetic basis of the differences between the species in pitcher morphology, insect digestion strategy, and the degree to which individual plant genotypes can support the associated microbial community.
Kris Dahl, Carnegie Mellon University
Computational approaches to determine multiscale structural changes in the nucleus associated with aging
There are numerous premature aging disorders associated with altered nuclear structure. We are primarily interested in Hutchinson Gilford progeria syndrome (HGPS) which is caused by a mutation in nuclear lamin A. (1) Using integrated experimental and computational studies we examine how the mutation associated with HGPS alters the structure of the protein, even though the mutation is in an inherently disordered region. (2) We have also simulated the structural filament network in the nucleus as a reductionist model to examine the cause of morphological changes in the nucleus associated with HGPS. (3) We examine how the HGPS mutation alters nuclear response to force in situ using computational methods to analyze complex mechanical character from live cell experiments. (4) At the microlevel we also use computational image analysis of a variety of premature aging diseases to understand the role of nuclear morphology in disease progression. In sum, we have used a combination of computation and experiment to examine structural proteins in the nucleus at many length scales and how they impact the etiology of HGPS.
Klaus Palme, University of Freiburg (Germany)
The magic role of auxin and beyond
Unlike most animal cells, plant cells can easily regenerate new tissues from cells derived from different tissues. These cells first dedifferentiate but later can be reprogrammed to form a wide variety of organs when properly cultured. We investigate the signalling components and molecular mechanisms that provide plant cells with the property to regenerate de novo organs. Plant hormones like auxin (indole-3-acetic acid) play a fundamental role in plant cell proliferation, differentiation and organ formation. Auxin levels are controlled by biosynthesis, transport and degradation. Since its first description in the 19th century, the directional movement of auxin through the plant has attracted much attention for more than a century. An overview will be given on the current status of studies aiming to understand the physiology of auxin transport and structure-function characterization of the PIN interactom. Components of PIN nano-domains play crucial roles in determining instructive auxin gradients that direct plant development. As systems biology demands quantitative, comprehensive data, which need to be mapped into the three-dimensional landscape of cells, tissues and organs, tools for establishment of a robust three-dimensional (3D) digital atlas of cellular components in Arabidopsis roots were developed. Such an atlas may have important implications by providing previously unavailable knowledge on cellular characteristics. The intrinsic Root Coordinate System provides a reference model for the root apical meristem to annotate cells according to their location, type, and division status. This enables the direct quantitative comparison between roots at single cell resolution. Applications and innovative opportunities arising from this technological advance will be discussed.
John Shon, Director of Disease and Translational Information, Hoffman-La Roache Pharmaceuticals
Drugs to Glide from Research to the Bedside: Opportunities for Software and IT in the Life Sciences
Ever wonder what opportunities exist in Life Science IT and software? The companies require the exchange of critical information that is rich and complex. The drug development process, for example, is greatly enhanced when valuable “nuggets” are passed between professionals that are focused on the start versus the end of the drug development process. This has not easy to accomplish. There are a number of such applications that will help accelerate and enable the creation and production of better therapeutics. Dr. Shon will present his vision for software enabled solutions that solve the multiple challenges facing large Pharma.
Michael Gilson, University of California, San Diego
Modeling molecular recognition: Free energy, entropy and mechanical stress
Better computer models of molecular recognition are needed to speed the design of new therapeutics and host-guest systems with a range of applications. I will discuss concepts and software we are developing for these purposes, as well as some unexpected insights into changes in entropy and mechanical stress on binding that have emerged from this work. In particular, changes in configurational entropy on binding appear to be as quantitatively important as changes in more commonly recognized free energy contributions, such as hydrogen bonding, and I will discuss recent developments in the characterization of entropy changes through the mutual information expansion of the entropy. In addition, we have begun to explore the application of ideas of mechanical stress at the molecular level as a potential basis for understanding the long-ranged transmission of information and other molecular mechanisms.
Jean-Christophe Olivo-Marin, Institut Pasteur
Quantitative biological imaging: from cells to numbers.
This talk will present specific methods and algorithms fo of 2- and 3-D+t images sequences in biological microscopy and their use in the study of host-pathogen interactions. Our goal is to automate the quantification and analysis of dynamics parameters or the characterization of phenotypic and morphological changes occurring as a consequence of the interaction between microbes and target cells. The availability of this information and its thorough analysis is indeed of key importance to help deciphering underlying molecular mechanisms of infectious diseases. We will demonstrate algorithms for multi-particle tracking and active contours models for cell shape and deformation analysis and illustrate their application in projects related to the understanding of viruses, bacteria and parasites invasion of cells and tissues.
Chris Bakal, Institute of Cancer Research
Signaling Networks that Regulate Morphological Noise and Promote Exploratory Behavior
Cell shape is not encoded by genomes. Rather genes encode the signaling networks that allow cells to explore morphological space through random variations in cell shape, which we term morphological noise. Stochastic and deterministic amplification of these small variations in shape can lead to the phenotypic diversity necessary for cells to adapt to unpredictable fluctuations in cellular environment. Morphological noise thus creates an ensemble of cell shapes that somatic variation can act upon, which can ultimately be stabilized via genetic evolution. To provide insight into how signaling networks regulate the exploration of shape space we perform quantitative measurements of single cell morphology in the context of genome scale RNAi screens. I will discuss the identification of noise enhancing local networks that act to regulate diverse cellular processes whose inhibition leads to canalized phenotypes and facilitate stochastic exploratory behavior. Furthermore, we have identified a number of other genes that act as morphological noise suppressors. Through computational integration of noise signatures with orthogonal datasets we derive a dynamic model that describes the information flow on a systems-level.
Paul Boutros, Ontario Institute for Cancer Research
Prognostic Markers for Non-Small Cell Lung Cancer
Lung cancer is a disease with dismal prognosis; only 15% of newly-diagnosed patients survive for five years. Our understanding of how to diagnose, stage, and treat it is based largely on macroscopic or cellular phenomena. A molecular understanding of the disease may provide improved clinical management and new therapeutic options.
My group focuses on predicting the survival of lung cancer patients. In particular, we develop algorithms to exploit microarray datasets to develop biomarkers of survival, called prognostic markers. In this talk I will describe three recent results: an algorithm, a database, and an empirical finding.
First, I describe a new feature selection algorithm, called modified steepest descent (mSD). This algorithm couples gradient-descent with unsupervised machine-learning. Through greedy forward-selection it generates a six-gene prognostic marker for lung cancer that is validated in over 500 patient samples.
Second, I describe a meta-analytic database that compiles the data from nine transcriptomic studies of lung cancer. These studies were integrated using a novel normalization approach and then subject to meta-analysis. For each gene present in the analysis (16,391 in total), the univariate prognostic capacity was calculated. I show that this database increases our statistical power sufficiently to allow separate analysis of different histological subtypes of lung cancer.
Third, I describe an analysis of biomarker plurality. From an empirical study of biomarker-space we found that the number of effective markers is very large. The inter-relationship amongst these markers contains information about gene-gene interactions, and may provide an avenue for understanding the specific pathways dysregulated in lung cancer.
Lung cancer incidence remains high and survival remains low. The development of prognostic markers may improve this situation by allowing personalized therapy. The computational approaches described here may be applicable beyond this one disease, and may provide insight into the types of methodologies that will work well for other problem-domains.
Mona Singh, Princeton University
Predicting and Analyzing Cellular Networks
Proteins accomplish virtually all of their cellular functions via interactions with other molecules. As a result, a broad array of computational methods have been developed to predict protein interactions, whether with DNA, other proteins, or small molecules. In combination with high-throughput experimental technologies, we now have the ability able to build large scale biological networks across the evolutionary spectrum. Global analyses of these networks provide new opportunities for revealing protein functions and pathways and for uncovering cellular organization principles.
In my talk I will discuss computational approaches that my group as developed for the complementary problems of predicting interactions and analyzing interaction networks. In the first part of the talk, I will describe sequence and structure approaches for predicting sites in protein sequences that interact with small molecules. In the second part of my talk I will discuss and describe algorithms for analyzing protein function and functional modules, and will present framework for explicitly incorporating known attributes of individual proteins into the analysis of biological networks, thereby allowing us to discover recurring network patterns underlying a range of biological processes.
Nancy Amato, Texas A&M University
Using Motion Planning to Study Molecular Motions
Protein motions, ranging from molecular flexibility to large-scale conformational change, play an essential role in many biochemical processes. For example, some devastating diseases such as Alzheimer’s and bovine spongiform encephalopathy (Mad Cow) are associated with the misfolding of proteins. Despite the explosion in our knowledge of structural and functional data, our understanding of protein movement is still very limited because it is difficult to measure experimentally and computationally expensive to simulate.
In this talk we describe a method we have developed for modeling protein motions that is based on probabilistic roadmap methods (PRM) for motion planning. Our technique yields an approximate map of a protein’s potential energy landscape and can be used to generate transitional motions of a protein to the native state from unstructured conformations or between specified conformations. We describe a method based on rigidity theory that allows us to sample conformation space more efficiently than our initial sampling strategy and enables us to study a broader range of motions for larger proteins and new analysis tools that enable us to extract kinetics information, such as folding rates. For example, we show how our map-based tools for modeling and analyzing folding landscapes can capture subtle folding differences between protein G and its mutants, NuG1 and NuG2. In recent work, we have applied our techniques to identify and study the folding core. More information regarding our work, including an archive of protein motions generated with our technique, are available from our protein folding server: http://parasol.tamu.edu/foldingserver/
Seyoung Kim, Carnegie Mellon University
Understanding the Genetic Basis of Complex Diseases via Genome-Phenome Association
Genome-wide association studies have recently become popular as a tool for identifying the genetic loci that are responsible for increased disease susceptibility by examining genetic and phenotypic variation across a large number of individuals. The cause of many complex disease syndromes involves the complex interplay of a large number of genomic variations that perturb disease-related genes in the context of a regulatory network. As patient cohorts are routinely surveyed for a large number of traits such as hundreds of clinical phenotypes and genome-wide profiling for thousands of gene expressions, this raises new computational challenges in identifying genetic variations associated simultaneously with multiple correlated traits. In this talk, I will present algorithms that go beyond the traditional approach of examining the correlation between a single genetic marker and a single trait. Our algorithms build on a sparse regression method in statistics, and are able to discover genetic variants that perturb modules of correlated molecular and clinical phenotypes during genome-phenome association mapping. Our approach is significantly better at detecting associations when genetic markers influence synergistically a group of traits.
Alexander Schoenhuth, University of California-Berkeley
Classifying cancer tissue by inferring systemic markers
It has recently been shown that protein-protein interaction (PPI) subnetworks which exhibit synergistic differential gene expression in tumorigenic phenotypes are more accurate than single gene markers when it comes to classifying such phenotypes. Here we compute markers as connected subnetworks in confidence-scored PPI networks which achieve high overall confidence scores and are dysregulated in a sufficient number of patients. We do this by employing a novel, exhaustive search technique which, for the first time, renders the inherent search problem on weighted-edge networks tractable. We compute p-values for the resulting subnetworks and use the most significant candidates for classification purposes. Thereby we obtain sets of systemic markers which are superior in terms of gene ontology (GO) term enrichment. As a result, we outperform all prior approaches when classifying colon cancer versus healthy tissue.
Can Alkan, University of Washington and Howard Hughes Medical Institute
Discovery and Characterization of Copy-Number Variants with Next-Gen Sequencing Technologies
Structural variation, in the broadest sense, is defined as the genomic changes among individuals that are not single nucleotide variants. These include insertions, deletions, duplications, inversions and translocations that were demonstrated to be common and ubiquitous among individuals. A variety of diseases have been associated (both causative and protective) with copy-number variants (CNVs) such as schizophrenia, mental retardation, and HIV susceptibility/resistance. However, CNVs, especially duplicated regions, have remained largely intractable due to difficulties in accurately resolving their structure, copy number and sequence content using hybridization based methods. Consequently, a significant fraction of the duplicated genomic content has not been assayed by standard genetic and molecular analyses.
The realization of new ultra-high-throughput sequencing platforms such as Roche/454, Illumina/Solexa and ABI/SOLiD now makes it feasible to detect the full spectrum of genomic variation among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. Recently I have developed a set of computational methods to comprehensively detect and characterize structural variation and segmental duplications using next-gen sequencing. My algorithms are based on two different approaches: (i) read-depth analysis to characterize segmental duplications and predict absolute copy numbers (mrFAST), and (ii) read-pair analysis to discover structural variation including inversions (VariationHunter). I applied my algorithms to detect structural variation and segmental duplications to genomes sequenced by Illumina and 454 technologies. I initially examine the genomes of three humans and experimentally validate copy-number differences in the organization of these genomes, and the application of my methods to study the genomes of >160 individuals sequenced as part of the 1000 Genomes Project.
Tandy Warnow, University of Texas at Austin
Simultaneous Alignment and Phylogeneic Tree Estimation
Molecular sequences evolve under processes that include substitutions, insertions, and deletions (jointly called “indels”), as well as other mechanisms (e.g., duplications and rearrangements). The inference of the evolutionary history of these sequences has thus been performed in two stages: the first estimates the alignment on the sequences, and the second estimates the tree given that alignment. While such methods seem to work well on relatively small datasets, these two-stage approaches can produce highly incorrect trees and alignments when applied to large datasets, or ones that evolve with many indels. In this talk, I will present a new method, SATe, that my lab has been developing that uses maximum likelihood to estimate the alignment and tree at the same time, and that can be used to analyze datasets with up to 1000 sequences on a desktop in 24 hours. Our study, using both real and simulated data, shows that this method produces much more accurate trees than the current best methods. Joint work with Kevin Liu, Sindhu Raghavan, Serita Nelesen, and Randy Linder.
Cheemeng Tan, Duke University
Emergent bistability in bacteria and implications for effective antibiotic treatment
A synthetic gene circuit is often engineered by considering the host cell as an invariable “chassis”. Circuit activation, however, may modulate host physiology, which in turn can drastically impact circuit behavior. In this talk, I will first discuss the engineering of a simple circuit consisting of mutant T7 RNA polymerase (T7 RNAP*) that activates its own expression in bacterium Escherichia coli (1). Although activation by the T7 RNAP* is noncooperative, the circuit caused bistable gene expression. This counterintuitive observation can be explained by growth retardation caused by circuit activation, which resulted in nonlinear dilution of T7 RNAP* in individual bacteria. Predictions made by models accounting for such effects were verified by further experimental measurements. Our results reveal a novel mechanism of generating bistability and underscore the need to account for host physiology modulation when engineering gene circuits.
Interestingly, bistability can also arise from interactions between bacterial physiology and antibiotics. We find that certain antibiotics, when applied at moderate concentrations, can cause ‘phenotypic bifurcation’ in bacterial growth: for the same concentration of antibiotic, a bacterial population survives only if its initial density is sufficiently high. We further show that the phenotypic bifurcation has profound implications for periodic treatment of bacteria by antibiotics. In the absence of phenotypic bifurcation, the efficacy of treatment increases with increasing frequency of antibiotic administration; otherwise, however, the efficacy of treatment can be drastically diminished at an intermediate frequency. Our results have implications on optimal design of antibiotic treatment.
(1) C. Tan, P. Marguet, and L. You. Emergent bistability by a growth-modulating positive feedback circuit. Nature Chemical Biology, 5, 842-848, 2009.
Highlighted in “News and Views”: Slow growth leads to a switch, Nature Chemical Biology, 5, 784-785, 2009.
Junming Yin, Univ. of California, Berkeley
A new statistical model for studying gene conversions
Together with crossover recombination, gene conversion is a major evolutionary mechanism responsible for shaping observed genetic variation in a population. Although crossovers and gene conversions have different effects on the evolutionary history of chromosomes and therefore leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. In fact, the methods employed in recent studies of recombination rate variation in the human genome actually capture combined effects of crossovers and gene conversions.
Studying gene conversion is very important, for it has been argued that ignoring gene conversion may cause problems in association studies. By explicitly incorporating overlapping gene conversion events, we propose a new statistical model that can jointly estimate the crossover rate, the gene conversion rate and the mean tract length, which is widely regarded as a very difficult problem. Our simulated results show that modeling overlapping gene conversions is crucial for improving the accuracy of the joint estimation of the aforementioned three fundamental parameters. Our analysis of real data from the telomere of the X chromosome of Drosophila melanogaster suggests that the ratio of the gene conversion rate to the crossover rate for the region may not be nearly as high as previously claimed.
Joint work with Michael I. Jordan and Yun S. Song.
Marcel Schulz, Max Planck Institute for Molecular Genetics
From RNA-Seq to Ontology Graphs: Application of probabilistic models
In the first part of my talk I am going to present methods that deal with the inference of alternative splicing events from high-throughput sequencing of mRNAs (RNA-Seq) data. Starting from millions of paired-end RNA-Seq reads, we attempt to reconstruct the original mRNA sequences without using a genomic reference sequence. We use a de Bruijn graph approach to the problem and show that many different types of alternative splicing events can be decoded from the topology of the simplified de Bruijn graph. The approach is implemented in the software Oases. Remarkably, application to data from the RGASP competition demonstrate its usefulness for organisms of various degrees of complexity. A statistical method was developed that subsequently allows the expression levels of the reconstructed mRNAs to be inferred.
The second part of the talk will be about a new statistical method for semantic similarity searches in Ontology Graphs. The method is different from previous approaches because it incorporates the probability of random similarity scores and assigns p-values to them. An efficient algorithm has been developed that allows exact p-values to be computed. The use of the new method is illustrated with the Phenomizer webserver that assists medical geneticists in the differential diagnostic process using features of the Human Phenotype Ontology annotated to OMIM diseases.
Quaid Morris, University of Toronto
Predicting the targets of mRNA-binding proteins
RNA-binding domains are among the most common domains in eukaryotic genomes and RNA-binding proteins (RBPs) play critical roles in post-transcriptional regulation (PTR) of gene expression by regulating mRNA processing, mRNA translation, mRNA export and mRNA stability. However, despite their importance, little is known about how RBPs identify their target sites.
As a first step towards building quantitative models of PTR, we are mapping out mRNA and RBP interactions using a combined biochemical and computational strategy. Our strategy is based on a microarray-based assay, called RNAcompete, that measures the binding affinity of a recombinant RBP for hundreds of thousands of short RNA sequences.
These sequences are designed to comprehensively query the space of possible binding preferences. We use a new RNA motif finding
algorithm, RNAcontext, to infer sequence and structural binding preferences of RBPs from the RNAcompete data. However, using these motif models to find RBP binding sites on mRNAs requires estimating mRNA secondary structure computationally. Some of our recent work suggests that estimating this structure is easier than expected.
Hsiao-Mei Lu, Univ. of Illinois at Chicago
Dynamics of Biological Systems: Allosteric Signal Transmission and Epigenetic Circuits
The dynamics of biological networks is critically important in conducting cellular functions. It is often a challenging task to study the dynamics of networks due to the size and complexity. Based on our successful work in characterizing protein folding dynamics in such a large conformational space through a long time evolution, the same method is proposed to study dynamics and time evolution of allosteric signal transmission and epigenetic circuit.
Large macromolecular assemblies are often important for biological processes in cell. Allosteric communications between different parts of these molecular machines play critical roles in cellular signaling. Although studies of the topology and fluctuation dynamics of coarse-grained residue network can yield important insight, they do not provide characterization of time-dependent dynamic behavior of these macromolecular assemblies. Here we develop a novel approach called Perturbation-based Markovian Transmission (PMT) model to globally study the dynamic responses of the macromolecular assemblies. By monitoring simultaneous responses of all residues (>8,000) across many (>6) decades of time span from the initial perturbation until reaching, we show that this approach can yield rich information. With criteria based on quantitative measurements of relaxation half-time, flow amplitude change, and oscillation dynamics, this approach can identify pivot residues that are important for macromolecular movement, messenger residues that are key to signal mediating, and anchor residues important for binding interactions. Based on a detailed analysis of the GroEL-GroES chaperone system, we found that our predictions have an accuracy of 71-84% judged by independent experimental studies reported in the literature. I propose this computational method can detect allosteric signal transmission pathway, characterize the roles of functionally important residues, and make novel predictions about the importance of additional amino acid residues previously uncharacterized, which can be further tested in experimental studies. This approach is general and can be applied to other large macromolecular machineries such as virus capsid and ribosomal complex.
Models based on the chemical master equation can describe the interactions involved in biomolecular networks accurately. An epigenetic circuit of phage lambda switch in E. coli cells is modeled by the chemical master equation with full stochasticity. Based on the successfully developed model, the specific coopperative binding of CI dimer to OR1 and OR2 is found to be the only crucial one to maintain a stable and robust phage lambda switch. The explicit computational study of the mutations of the binding for CI dimer and Cro dimer to OR3 show that Cro dimer is necessary in an efficient phage lambda induction. The DNA looping, double positive and negative regulations, and other biochemical mutations will be studied. Algorithms are also proposed to solve lager systems efficiently.
Hagit Shatkay, Queen’s University
Life by the Book: Pragmatically Using Text in Large Scale -Omics.
The genomic era, in which we live since the sequencing of the human genome, is characterized by tremendous amounts of biomedical data, accompanied by a significant increase in the number of related scientific publications.
Much biomedical knowledge is hidden within the abundant literature. The ability to rapidly and effectively survey the literature can support numerous applications, including multiple stages in the design and the interpretation of large-scale experiments.
A variety of methods are being applied to the biomedical literature in an attempt to meet these goals, mostly through careful mining of text for gene/protein names and interactions, using natural language processing methods. However, the idea of general “biomedical text mining” remains elusive.
Rather than view biomedical text mining as one monolithic (and not very well defined) task, we attend to specific biological goals that may benefit from the use of text. The talk will focus on several biological applications/problems involving text, and discuss some non-traditional, coarse-grain methods, that we use to address them.
Emma Lundberg, Royal Inst. of Technology, Stockholm
A Human Protein Atlas
Information on protein localization and expression on tissue, cell and organelle level is important to map and characterize the human proteome as well as to better understand cellular functions of proteins and to find biomarkers. In the Human Protein Atlas program the human proteome is systematically analyzed using an antibody-based approach. By generation and thorough validation of antibodies, protein localization and expression in human tissues and cells can be analyzed using immunohistochemistry and fluorescence confocal microscopy. The results are publicly available in the Human Protein Atlas web portal (www.proteinatlas.org) that currently contains results from the use of more than 8,800 validated antibodies corresponding to one third of all human genes. The portal contains more than 7 million high-resolution images that each has been manually annotated and curated by a certified pathologist or a cell biologist to provide a knowledge base for functional studies and to allow searches and queries about protein profiles in normal and disease tissue as well as on a cell and subcellular level. Advanced queries can be performed, including searches for chromosome location, protein class and/or tissue specificity (including the 20 most common forms of human cancer), facilitating for instance biomarker discovery. Our results suggest that it should be possible to extend the protein atlas to cover the majority of all human proteins thus providing a valuable tool for biological and medical research.
Nicholas Buchler, Rockefeller University
Bait and switch: How protein sequestration generates a flexible ultrasensitive response
Regulatory networks in cells exhibit important dynamical behaviors, such as bistability (e.g. epigenetic switch) and oscillation (e.g. clocks, cell cycle). Ultrasensitive or `all-or-none~R gene expression is a necessary feature for the emergence of such dynamics in gene networks. In biology, many regulatory molecules are sequestered by an inhibitor into an inactive complex. Using an experimental approach in budding yeast, I will demonstrate how protein sequestration generates tunable, all-or-none thresholds in gene expression. A simple quantitative model for this genetic network shows that both the threshold and the degree of ultrasensitivity depend upon the abundance of the inhibitor, exactly as observed experimentally. The abundance of the inhibitor can be altered by simple mutation; thus ultrasensitive responses mediated by protein sequestration are easily tunable. Gene duplication of regulatory homodimers and loss-of-function mutations can create dominant-negatives that sequester the original duplicate into an inactive complex. These results suggest a mechanism for the rapid evolution of bistable switches and oscillators in regulatory networks.
Speaker: Andrew Grimson, Massachusetts Inst. of Technology
Title: Animal microRNAs: their ancient origin and contemporary targets
Abstract: Hundreds of microRNAs (miRNAs) collectively regulate a substantial fraction of the animal transcriptome. Because virtually all aspects of biology are likely impinged upon by miRNAs, the identification of the mRNAs targeted by each miRNA remains a fundamental question. Specific ~7 nt recognition sequences, located primarily in 3′ UTRs, are important for target recognition. These sites are complementary to the 5′ end, or seed region, of the miRNA. However, seed matches are not sufficient for repression, indicating that other characteristics help specify miRNA targeting. By combining computational and experimental approaches, we discovered five features of site context that govern site efficacy. We developed a model that combines these context determinants to quantitatively predict site performance thereby indicating which of the thousands of potential miRNA-target relationships are functional. The predictions are made without recourse to site conservation, and are therefore effective at predicting a wide variety of target interactions including nonconserved sites and siRNA off-target effects.
The scale of transcriptome regulation by miRNAs together with the extent of miRNA conservation between bilaterians (e.g., humans, flies, and worms) is evidence for the importance of miRNA biology during animal evolution. In addition to miRNAs, other bilaterian small RNAs, known as Piwi-interacting RNAs (piRNAs), protect the genome from transposons. Neither miRNAs nor piRNAs were known to exist in the simplest, pre-bilaterian, animal phyla, raising the question of whether a rich small-RNA biology is characteristic of more complex animals, or whether these small RNAs might have emerged earlier in metazoan evolution. To gain perspective on the evolution of miRNAs and piRNAs, we used high-throughput sequencing to identify small RNAs from several basal animal lineages that diverged prior to the emergence of the Bilateria. We found that the cnidarian Nematostella vectensis, a relatively close relative of bilaterians, possesses an extensive repertoire of miRNA genes, two classes of piRNAs, and a complement of proteins specific to small-RNA biology comparable to that of humans. Similarly, the sponge Amphimedon queenslandica, amongst the simplest of animals and distant relative of bilaterians, also possess miRNAs, piRNAs and a full complement of small-RNA machinery. These data indicate that both miRNAs and piRNAs have existed from the earliest stages of metazoan evolution and have been available to shape gene expression throughout the evolution and radiation of animal phyla.
Eric Deeds, Harvard Medical School
Dynamic individuality in protein-protein interaction networks
Protein-protein interactions play a crucial role in all cellular processes, from the regulation of gene expression to the transduction and processing of extracellular signals. Over the past decade, high-throughput techniques such as Yeast 2-Hybrid (Y2H) and Tandem Affinity Purification (TAP-tagging) have provided a global picture of what the entire protein-protein interaction (PPI) network in certain organisms might look like. While these methods are often quite noisy (with potentially high rates of false positives and false negatives), they have nonetheless served as the substrate for a large body of work aimed at characterizing or explaining the general topological structure of these networks. Such purely topological studies are limited, however, by the fact that they consider a static description of an inherently dynamical system. A full characterization and understanding of the behavior of PPI networks clearly requires that one be able to describe and understand the dynamics of hundreds to thousands of objects physically interacting with one another. In this work we employ recently developed rule-based modeling techniques to perform the first large-scale stochastic simulations of the PPI network found in the cytoplasm of yeast cells. These simulations reveal that cells prepared in identical initial conditions will, at steady state, differ considerably from one another in terms of the identities of the large protein complexes found in each. Our results indicate that such dynamic individuality may arise in many complex interaction and signaling networks.
Su-In Lee, Carnegie Mellon University
Individual Genetic Variation and Gene Regulation: From Networks to Mechanisms
Gene expression data of genetically diverse individuals (eQTL data) provide a unique perspective on the effect of genetic variation on cellular pathways, and help identify sequence variations with phenotypic effect. However, the large number of possible regulatory interactions, combined with the challenges of linkage disequilibrium (LD), makes it difficult to correctly identify causal polymorphisms. To resolve this problem, researchers traditionally apply heuristics for selecting among plausible hypotheses, favoring polymorphisms that are more conserved, that lead to significant amino acid change, or that reside in genes whose function is related to that of the targets. We can construct a list of properties (called, regulatory features) that can indicate how likely each polymorphism having that property changes the gene regulatory network. But how do we know how much weight to attribute to different regulatory features? This talk describes a novel method, called Lirnet (linear regulation network), for identifying regulatory networks from eQTL data. Lirnet automatically learns from eQTL data how to weight regulatory features and induce a regulatory potential for candidate sequence variations. Lirnet assesses these weights simultaneously to learning a regulatory network, finding weights that lead to a more predictive network. This feature, combined with Lirnet’s ability to learn the importance of these features automatically, makes it especially advantageous for mammalian systems, where many forms of prior knowledge used in simple model organisms are incomplete or unavailable.
We apply Lirnet to eQTL data in yeast, mouse and human (Phase II HapMap data), and provide statistical and biological results demonstrating that Lirnet produces significantly better regulatory programs than other recent approaches. We demonstrate in the yeast data that Lirnet can correctly suggest a specific causal sequence variation within a large, linked chromosomal region. In yeast, Lirnet uncovered a novel, experimentally validated connection between Puf3, a sequence-specific RNA binding protein, and P-bodies, cytoplasmic structures that regulate translation and RNA stability, as well as the particular causative polymorphism, a SNP in Mkt1, that induces the variation in the pathway.
Derek Ruths, Rice University
Execution Strategies for Executable Biological Models
Progress in advancing our understanding of biological systems is limited by their sheer complexity, the cost of laboratory materials and equipment, and limitations of current laboratory technology. Computational and mathematical modeling provides ways to address these limitations through hypothesis generation and testing without experimentation – allowing researchers to analyze system structure and dynamics in silico and, then, design lab experiments that yield desired information about phenomena of interest. These models, however, are only as accurate and complete as the data used to build them. Currently most models are constructed from quantitative experimental data. However, since accurate quantitative measurements are hard to obtain and difficult to adapt from literature and online databases, new sources of data for building models need to be explored. In my research, I design methods for building and executing computational models of cellular networks based on qualitative experimental data, which is more abundant, easier to obtain, and reliably reproducible. Such executable models allow for in silico perturbation, simulation, and exploration of biological systems. In this talk, I will present two general strategies for building and executing Petri net-based models of biochemical networks. Both have been successfully used to model and predict the dynamics of signaling networks in normal and cancer cell lines, rivaling the accuracy of existing methods trained on quantitative data.
This work is done in collaboration with Luay Nakhleh (Rice University) and Prahlad T. Ram (MD Anderson Cancer Center).
Phil Hyoun Lee, Queen’s University
Selecting single nucleotide polymorphisms for effective genetic association study
Genetic variation analysis holds much promise as a basis for understanding disease-gene association. In particular, single nucleotide polymorphisms (SNPs) are at the forefront of such studies, as they are the most common form of DNA variation on the genome. However, due to the tremendous number of candidate SNPs, there is a clear need to expedite genotyping and analysis by selecting and considering only a subset of all SNPs.
In this talk, I will present three machine learning applications that successfully address the problem of SNP selection and improve current state-of-the-art. The first tag SNP selection approach aims to choose a subset of SNPs whose allele information can best represent the allele information of unselected SNPs. Using the formalism of Bayesian networks, it enables to select a subset of independent and highly predictive SNPs, without limiting the number or the location of predictive tag SNPs. The second method is based on the functionality of SNPs. It aims to directly select a subset of SNPs that are likely to be disease-causing. In the probabilistic framework, our integrative scoring system combines the functional assessments from a variety of bioinformatics tools, and prioritizes SNPs according to their potential deleterious effects to major biological functions. Lastly, I will describe a new multi-objective optimization framework for identifying SNPs that are both informative tagging and have functional significance.
Xin Gao, University of Waterloo
Zero in on the fully automated NMR protein structure determination
High-throughput structural genomics requires parallelizable technologies for high-resolution protein structure determination. Nuclear Magnetic Resonance (NMR) would be such a technology if its tedious and lengthy process can be fully automated. In the talk, I will describe our efforts on a fully automated protocol for NMR protein structure determination. We have developed a singular value decomposition-based peak picking method, PICKY, which achieves an average of 88% recall and 74% precision over 32 raw spectra extracted from eight proteins. Existing resonance assignment methods, however, do not work well on incomplete and imperfect peak lists. Consequently, we have designed an integer linear programming-based assignment method. It significantly outperforms other existing programs on both perfect peak lists and noisy peak lists. With the partial resonance assignments, FALCON-NMR is developed as a hidden Markov model-based torsion angle sampling method. The whole system, AMR, has been successfully tested to on four proteins with weights of approximately 15kDa.
William Noble, University of Washington
Machine learning analysis of shotgun proteomics data
Mass spectrometry has become the most widely used tool for the characterization of proteins within complex mixtures. In this talk, I will describe several successful applications of machine learning to improve the rate at which we can correctly assign peptide sequences to observed tandem mass spectra. We use supervised and semi-supervised discriminative learning methods to train a classifier that discriminates between correctly and incorrectly annotated spectra. Unlike previous methods, the classifier can be trained dynamically on each given data set, thereby adjusting to particular characteristics of the sample preparation protocol, machine platform, calibration and chromatography conditions. We have also trained a dynamic Bayesian network to model the process of peptide fragmentation within the mass spectrometer. The resulting model yields useful insights into fragmentation biochemistry as well as significantly improved peptide identification performance.
Gad Kimmel, University of California, Berkeley
Computational Problems in Human Genetics
The question how genetic variation and personal health are linked is one of the compelling puzzles facing scientists today. The ultimate goal is to exploit human variability to find genetic causes for multi-factorial diseases such as cancer and coronary heart disease. Recent technology improvement enables the typing of millions of single nucleotide polymorphisms (SNPs) for a large number of individuals. Consequently, there is a great need for efficient and accurate computational tools for rigorous and powerful analysis of these data. In my talk I am going to concentrate on two computational problems, which are an essential step in studying the data obtained by this technology: Accurate and efficient significance testing with a correction for population stratification and estimating local ancestries in admixed populations.
Itamar Simon, Hebrew University
A high resolution map of mouse genome replication timing suggests a role in gene regulation
Although it is known that genomes are divided into distinct replication time zones, a more detailed understanding of their organization is limited. Taking advantage of a novel synchronization method and of genomic DNA microarrays we have mapped replication times of the entire mouse genome at a high temporal resolution. The measurement results have allowed us to assign distinct replication times to 91% of the genome, define asynchronously replicating regions and identify very large replicons. Analysis of the association between replication and transcriptional features has revealed a correlation between replication and transcription potential as well as evolutionary conservation of replication timing. Finally, analysis of large replicons, and in particular of regions at which the time of replication differs from the time of replication of a distant origin, reveals that transcription is correlated with the actual time of replication and not with the time of origin activation. Overall, these findings suggest that early replication plays a causal role in potentiating gene transcription.
Olivier Elemento, Princeton University
Decoding the regulatory genome
Deciphering the non-coding regulatory genome has proved a formidable challenge. Despite the wealth of available gene expression data, there currently exists no broadly applicable method for characterizing the regulatory elements that shape the rich underlying dynamics. I will present a general framework for detecting such regulatory DNA and RNA motifs that relies on directly assessing the mutual information between sequence and gene expression measurements. Our approach makes minimal assumptions about the background sequence model and the mechanisms by which elements affect gene expression. This provides a versatile motif discovery framework, across all data types and genomes, with exceptional sensitivity and near-zero false-positive rates. Applications from yeast to human uncover novel putative and established transcription-factor binding and miRNA target sites, revealing rich diversity in their spatial configurations, pervasive co-occurrences of DNA and RNA motifs, context-dependent selection for motif avoidance, and the strong impact of post-transcriptional processes on eukaryotic transcriptomes. This approach complements our previous and ongoing work using comparative genomics, and represents a major contribution to our ongoing effort to systematically characterize eukaryotic regulatory elements and understand their role in complex processes such as development, aging and disease.
Philip Kim, Yale University
Jumping scales: How 3D structures and molecular genetics meet in protein networks
Protein interaction networks form the central layer of a systems-level description of the cell. While most studies of protein networks operate on a high level of abstraction, neglecting structural and chemical aspects of each interaction, I will describe our approach of characterizing interactions by using atomic-resolution information from three-dimensional protein structures. We find that some previously recognized relationships between network topology and genomic features (e.g., hubs tending to be essential proteins) are actually more reflective of a structural quantity, the number of distinct binding interfaces. Subdividing hubs with respect to this quantity provides insight into their evolutionary rate and indicates that additional mechanisms of network growth are active in evolution.
Furthermore, I will provide an overview of a major international collaborative effort that aims to resolve interactions involved in signaling pathways. These tend to involve intrinsically disordered regions are hence complementary to the structured interactions studied by the above approach. Our approach combines modern experimental screening techniques with a novel integrated analysis pipeline. The former screens measure binding specificities with hitherto unachievable accuracy and the analysis pipeline maximizes prediction accuracy by integrating a variety of genomic and proteomic features.
Lastly, I will present a study that examined the relationship between genetic signatures of adaptive evolution and proteomic properties, such as the location of sites in protein networks and structures. Due to recent advances in genotyping and sequencing technology, human genetic variation and adaptive evolution in the primate lineage have become a major research focus. We find a striking tendency of proteins that have been subject to adaptive evolution (as compared to the chimpanzee) to be located at the periphery of the interaction network. We also find that the fixation of large-scale copy number variants into segmental duplications also preferentially occurs at the network periphery, bolstering our argument for selection at periphery. This suggests that the observed preferential selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.
Han Liang, University of Chicago
System Structures and MicroRNA regulation in humans: a view of systems biology
MicroRNAs are ~22nt non-coding RNAs that can post-transcriptionally repress the expression of many protein-coding genes in higher eukaryotes. Recently available functional genomic data enables us to examine the regulatory role of microRNAs at the system level. Integrating human protein-protein interaction and microRNA targeting data, I found a global correlation between protein connectivity and microRNA regulation complexity in the corresponding genes, and that microRNA regulation likely coordinates the behavior of interacting partners. To understand the evolution of microRNA-mediated regulation in humans, I evaluated the role of three types of nucleotide variation on microRNA targeting: variation between species, variation within populations and epigenetic variation. While purifying selection appears to be a driving force maintaining the stability of microRNA regulation at the system level, a small amount of variants may have significant functional effects. In particular, I found an appreciable level of polymorphism at microRNA target sites (including SNPs with a signature of positive selection or within important disease genes), which suggests that allele-specific microRNA regulation is an important source of phenotypic differences among individuals.
Ge Yang, Scripps Research Institute
Metaphase spindle architecture and molecular motor coordination revealed by model driven computer vision
The development of biology over the past half century makes it possible to identify the complete set of genes and proteins of an organism. A fundamental challenge remains, however, to understand the complex dynamics of and interactions between the many individual molecular components involved in situ and in space and time. Of particular importance in addressing this challenge is to understand how force and motion are generated, transmitted, and controlled within dynamic cellular structures during basic cellular processes. In this presentation, I will focus on addressing this question in two such processes: cell division and intracellular transport. First, single-fluorophore imaging and biochemical perturbation are used to investigate architecture of the metaphase microtubule cytoskeleton in cell division. This assay provides a model system to understand how cytoskeletal filament networks are dynamically organized to transmit force and to directly generate force. Second, fluorescence imaging and genetic manipulation are used to probe the interaction between molecular motors in the axonal transport machinery of neurons. This assay provides a sufficiently reduced yet extremely powerful model system to understand the interactions between molecular motors of same and opposite polarities in force and motion generation. Shared by both studies is the use of computer vision techniques, driven by mechanistic models, to extract high-resolution quantitative measurements of the complex spatial-temporal dynamics visualized by powerful fluorescence live cell imaging techniques. These studies reveal some fundamental and exquisite connections between force and motion generation and the dynamic organization of the cytoskeleton in cellular life.
Kevin Chen, New York University
Macro- and micro-evolution of gene regulation mediated by microRNAs
Studying the evolution of cis-regulatory elements is important for three general reasons. First, mutations in these elements can cause phenotypes of medical importance; second, understanding cis-element evolution will help us design algorithms for predicting these elements; third, regulatory evolution is important for understanding phenotypic evolution. In this talk, I will focus on a class of cis-elements called “microRNA sites”. MicroRNAs are small, noncoding RNAs that post-transcriptionally regulate their target mRNAs by binding to these sites. They have been implicated in many biological processes, including cancer and viral defense.
I will discuss the evolution of animal microRNA sites at two different time scales. At the macro-evolutionary time scale, we show that while the microRNA genes are well-conserved, overall their targets have diverged rapidly. However, there exists a core of deeply-conserved regulatory relationships that may be an important component of animal developmental networks. At the micro-evolutionary time scale, we use human SNP genotype data to demonstrate significant selective constraint on microRNA sites, implying that polymorphisms in these sites are candidates for causal variants of human disease. Our approach also applies to human-specific microRNA sites and we use it to identify a set of these sites in genes co-expressed with the microRNA.
James Taylor, New York University
Making sense of genome-scale data
High-throughput data production technologies are revolutionizing modern biology. Translating this experimental data into discoveries of relevance to human health relies on sophisticated computational tools that can handle large-scale data (e.g. multiple genome alignments of dozens of species or billion genotype genome-wide association studies).
This talk will first discuss a specific large-scale data analysis problem: using comparative genomics to identify and understanding functional genomic regions, particularly cis-regulatory elements. Using data generated by the ENCODE project we will demonstrate the power of genome comparisons to distinguish these elements from neutral DNA and the importance of looking for more than just signs of strong evolutionary constraint. We will then describe a machine learning approach that goes beyond sequence conservation and attempts to capture broader and more informative sequence and evolutionary patterns that better distinguish different classes of elements. This approach, denoted ESPERR, uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR has proven successful for a variety of classification problems. In particular, the “Regulatory Potential Score” produced using ESPERR has been used to identify putative regulatory elements with high rates of experimental validation.
Second, we will consider the more general problem of making sophisticated computational methods more available to experimental biologists. Many powerful analysis tools exist or are currently being developed, along with many excellent data warehouses and browsers. However, for the average experimental biologists with limited computer expertise, making effective use of these tools and data sources is still out of reach because many existing tools do not have easy-to-use interfaces, and different tools and data sources are not well integrated. We have developed a framework and application, called Galaxy, that solves this problem by providing an integrated web-based workspace that bridges the gap between different tools and data sources. Galaxy simultaneously targets two audiences. For tool developers it eliminates the repetitive effort involved in creating high-quality user interfaces, while giving them the benefit of being able to provide their tools in an integrated environment. For experimental biologist it allows running complex analysis on huge datasets with nothing more than a web browser, and without needing to worry about details of installing tools, allocating computing resources, and file format compatibility. Galaxy is not only incredibly easy to use, it is also incredibly easy to deploy. A developer or lab can create their own Galaxy instance, and start integrating custom tools with only a few minutes work.
Insuk Lee, University of Texas at Austin
Network biology approaches to study complex traits
The relationship between genotype and phenotype is a central issue in genetics, and approaches are needed that allow us to interpret the increasing collection of data on genotypic variation in terms of the affect on organismal phenotypes. Our understanding of these relationships came historically from forward-genetics approaches, which have proved remarkably powerful, but which are still difficult in complex animals, and the complete definition of pathways from forward-genetic data alone is hard. In contrast, reverse-genetics approaches allow unbiased tests across entire genomes for associations with traits of interest, e.g., by using systematic genome-wide knock-out or silencing. However, reverse-genetics is in general labor intensive and time consuming, requiring enormous numbers of assays in order to span large number of genes in combination with multiple experimental conditions. Ideally, we would like to be able to choose which genes to Abstract: The relationship between genotype and phenotype is a central issue in genetics, and approaches are needed that allow us to interpret the increasing collection of data on genotypic variation in terms of the affect on organismal phenotypes. Our understanding of these relationships came historically from forward-genetics approaches, which have proved remarkably powerful, but which are still difficult in complex animals, and the complete definition of pathways from forward-genetic data alone is hard. In contrast, reverse-genetics approaches allow unbiased tests across entire genomes for associations with traits of interest, e.g., by using systematic genome-wide knock-out or silencing. However, reverse-genetics is in general labor intensive and time consuming, requiring enormous numbers of assays in order to span large number of genes in combination with multiple experimental conditions. Ideally, we would like to be able to choose which genes to target for reverse-genetics analyses, prioritizing the most likely candidates for being involved in a trait of interest. Such an approach would allow highly focused reverse-genetics studies to be performed, increasing both the sensitivity and efficiency of genetic screens. Here, we present a method for predicting gene loss-of-function phenotypes that can be applied to extend genetic screens and prioritize candidate genes for focused testing in from simple single cellular organism yeast to multicellular animal model C. elegans (worm) to human.