Graduate Courses Offered
02-601 Programming for Scientists
Provides a practical introduction to programming for students with little or no prior programming experience who are interested in science. Fundamental scientific algorithms will be introduced, and extensive programming assignments will be based on analytical tasks that might be faced by scientists, such as parsing, simulation, and optimization. Principles of good software engineering will also be stressed, *and students will have the opportunity to design their own programming project on a scientific topic of their course*. The course will introduce students to the Go programming language, an industry-supported, modern programming language, the syntax of which will be covered in depth. Other assignments will be given in other programming languages such as Python and Java to highlight the commonalities and differences between languages. No prior programming experience is assumed, and no biology background is needed. Analytical skills and mathematical maturity are required. Course not open to CS majors.
This course gives masters students an opportunity to develop professional skills necessary for a successful career in computational biology. This course will include assistance with resume writing, interview preparation, presentation skills, and job search techniques. This course will also include opportunities to network with computational biology professionals and academic researchers. This course will meet once per week. This course is pass/fail only. Grading scheme will be discussed on first day of class.
How do we find potentially harmful mutations in your genome? How can we reconstruct the Tree of Life? How do we compare similar genes from different species? These are just three of the many central questions of modern biology that can only be answered using computational approaches. This 12-unit course will delve into some of the fundamental computational ideas used in biology and let students apply existing resources that are used in practice every day by thousands of biologists. The course offers an opportunity for students who possess an introductory programming background to become more experienced coders within a biological setting. As such, it presents a natural next course for students who have completed 02-601.
This course gives MS in Automated Science students an opportunity to develop professional skills necessary for a successful career in computational biology. This course will include assistance with resume writing, interview preparation, presentation skills, and job search techniques. This course will also include opportunities to network with computational biology professionals and academic researchers.
The objective of this course is to study general computational problems, with a focus on the principles used to design those algorithms. Efficient data structures will be discussed to support these algorithmic concepts. Topics include: Run time analysis, divide-and-conquer algorithms, dynamic programming algorithms, network flow algorithms, linear and integer programming, large-scale search algorithms and heuristics, efficient data storage and query, and NP-completeness. Although this course will have several programming assignments, it is primarily not a programming course. Instead, it will focus on the design and analysis of algorithms for general classes of problems. This course is not open to CS graduate students who should consider taking 15-651 instead.
02-614 String Algorithms
Provides an in-depth look at modern algorithms used to process string data, particularly those relevant to genomics. The course will cover the design and analysis of efficient algorithms for processing enormous collections of strings. Topics will include string search; inexact matching; string compression; string data structures such as suffix trees, suffix arrays, and searchable compressed indices; and the Burrows-Wheeler transform. Applications of these techniques in biology will be presented, including genome assembly, transcript assembly, whole-genome alignment, gene expression quantification, read mapping, and search of large sequence databases. No knowledge of biology is assumed, and the topics covered will be of use in other fields involving large collections of strings. Programming proficiency is required.
With advances in scientific instruments and high-throughput technology, scientific discoveries are increasingly made from analyzing large-scale data generated from experiments or collected from observational studies. Machine learning methods that have been widely used to extract complex patterns from large speech, text, and image data are now being routinely applied to answer scientific questions. The course will select scientific questions that arise in genomics, population genetics, and medicine and discuss how to address these questions using machine learning techniques. It will cover disease-related genetic variant discovery with regression methods; clinical decision making for patients with classification methods; pathway discovery with clustering algorithms; learning gene regulatory networks with probabilistic graphical models; genome sequence analysis with hidden Markov models; making functional annotations of genomes with deep learning methods; and selecting an appropriate machine learning technique for the given scientific problem using learning theories. This course is intended for graduate students interested in learning machine learning methods for scientific data analysis and modeling. Programming skills and basic knowledge of linear algebra, probability, statistics are assumed. Homework assignments will consist of written problems and analyses of genetic and genomic data drawn from the literature in biology. The course grade will be computed as the result of homework assignments, midterm tests, and class participation.
This course rigorously introduces fundamental topics in mathematics and statistics to first-year master’s students as preparation for more advanced computational coursework. Topics are sampled from information theory, graph theory, proof techniques, phylogenetics, combinatorics, set theory, linear algebra, neural networks, probability distributions and densities, multivariate probability distributions, maximum likelihood estimation, statistical inference, hypothesis testing, Bayesian inference, and stochastic processes.
Students completing this course will obtain a broad skillset of mathematical techniques and statistical inference as well as a deep understanding of mathematical proof. They will have the quantitative foundation to immediately step into an introductory master’s level machine learning or automation course. This background will also serve students well in advanced courses that apply concepts in machine learning to scientific datasets, such as 02-710 (Computational Genomics) or 02-750 (Automation of Biological Research). The course grade will be computed as the result of homework assignments, midterm tests, and class participation.
02-700 M.S. Research
This course is for M.S. students who wish to do supervised research for academic credit with a Computational Biology faculty member. Interested students should first contact the Professor with whom they would like to work. If there is mutual interest, the Professor will direct you to the Academic Programs Coordinator, who will enroll you in the course.
The course consists of weekly presentations by students and faculty on current topics in computational biology.
This course consists of weekly invited presentations on current computational biology research topics by leading scientists.
02-703 Special Topics in Bioinformatics and Computational Biology
This is a mini Special Topics course taught on an occasional basis to cover different topics in computational biology.
02-710 Computational Genomics
Dramatic advances in experimental technology and computational analysis are fundamentally transforming the basic nature and goal of biological research. The emergence of new frontiers in biology, such as evolutionary genomics and systems biology is demanding new methodologies that can confront quantitative issues of substantial computational and mathematical sophistication. In this course we will discuss classical approaches and latest methodological advances in the context of the following biological problems: 1) sequence analysis, focusing on gene finding and motifs detection, 2) analysis of high throughput molecular data, such as gene expression data, including normalization, clustering, pattern recognition and classification, 3) molecular and regulatory evolution, focusing on phylogenetic inference and regulatory network evolution, 4) population genetics, focusing on how genomes within a population evolve through recombination, mutation, and selection to create various structures in modern genomes and 5) systems biology, concerning how to combine diverse data types to make mechanistic inferences about biological processes. From the computational side this course focuses on modern machine learning methodologies for computational problems in molecular biology and genetics, including probabilistic modeling, inference and learning algorithms, data integration, time series analysis, active learning, etc.
02-711 Computational Molecular Biology and Genomics
An advanced introduction to computational molecular biology, using an applied algorithms approach. The first part of the course will cover established algorithmic methods, including pairwise sequence alignment and dynamic programming, multiple sequence alignment, fast database search heuristics, hidden Markov models for molecular motifs and phylogeny reconstruction. The second part of the course will explore emerging computational problems driven by the newest genomic research. Course work includes four to six problem sets, one midterm and final exam.
This course covers a variety of computational methods important for modeling and simulation of biological systems. It is intended for graduates and advanced undergraduates with either biological or computational backgrounds who are interested in developing computer models and simulations of biological systems. The course will emphasize practical algorithms and algorithm design methods drawn from various disciplines of computer science and applied mathematics that are useful in biological applications. The general topics covered will be models for optimization problems, simulation and sampling, and parameter tuning. Course work will include problems sets with significant programming components and independent or group final projects.
Research in biology and medicine is undergoing a revolution due to the availability of high-throughput technology for probing various aspects of a cell at a genome-wide scale. The next-generation sequencing technology is allowing researchers to inexpensively generate a large volume of genome sequence data. In combination with various other high-throughput techniques for epigenome, transcriptome, and proteome, we have unprecedented opportunities to answer fundamental questions in cell biology and understand the disease processes with the goal of finding treatments in medicine. The challenge in this new genomic era is to develop computational methods for integrating different data types and extracting complex patterns accurately and efficiently from a large volume of data. This course will discuss computational issues arising from high-throughput techniques recently introduced in biology, and cover very recent developments in computational genomics and population genetics, including genome structural variant discovery, association mapping, epigenome analysis, cancer genomics, and transcriptome analysis. The course material will be drawn from very recent literature. Grading will be based on weekly write-ups for critiques of the papers to be discussed in the class, class participation, and a final project. It assumes a basic knowledge of machine learning and computational genomics.
02-716 Cross-Species Systems Modeling
Model organisms have longed played an important role in basic science studies and in the pharmaceutical industry. These organisms, ranging from yeast to worms to flies, share many processes that are similar to those active in humans which have made these and other animals the focus of many lab studies. Similarly, almost all drugs are initially tested on mice making cross species studies a key issue in drug development. However, many of the drugs that work well for mice fail in late stage human trials. Similarly, many interactions between highly conserved proteins in one species are not conserved, even between very close species. In this class we will discuss recent studies that try to compare and contrast genomics and functional genomics data across species with the goal of identifying the conserved and divergent processes that are active in each of the species being studied. The class will be divided into three parts. The first will focus on sequence analysis and comparative genomics covering issues related to whole genome sequence alignment, motif discovery using conservation data and miRNA identification using sequence data from multiple species. The second will focus on comparisons of a single type of functional genomics data including gene expression, protein interactions and protein-DNA interactions. This part will rely on recent studies regarding the integration of expression data across species, combining, comparing and aligning protein interaction networks in multiple species and experimental studies that compare protein-DNA interactions across species and in hybrids. In the final part of the class we will discuss methods that attempt to combine multiple functional genomics datasets for a systems biology comparison of interactions across species. Students would be required to present one or two papers and to complete a class project in which they compare or contrast genomics data across species.
02-717 Algorithms in Nature
Computer systems and biological processes often rely on networks of interacting entities to reach joint decisions, coordinate and respond to inputs. There are many similarities in the goals and strategies of biological and computational systems which suggest that each can learn from the other. These include the distributed nature of the networks (in biology molecules, cells, or organisms often operate without central control), the ability to successfully handle failures and attacks on a subset of the nodes, modularity and the ability to reuse certain components or sub-networks in multiple applications and the use of stochasticity in biology and randomized algorithms in computer science.
These observations, some dating back to the 60’s, have inspired the development of several computational methods and more recently led to several bi-directional studies. These studies have demonstrated that thinking computationally about the settings, requirements and goals of information processing in biological networks can both, improve our understanding of the underlying biology and lead to the development of novel computational methods providing solutions to decades old problems.
In this course we will start by discussing classic biologically motivated algorithms including neural networks (inspired by the brain), genetic algorithms (sequence evolution), non-negative matrix factorization (signal processing in the brain), and search optimization (ant colony formation). We will then continue to discuss more recent bi-directional studies that have relied on biological processes to solve routing and synchronization problems, discover Maximal Independent Sets (MIS), and design robust and fault tolerant networks. In the second part of the class students will read and present new research in this area. Students will also work in groups on a final project in which they develop and test a new biologically inspired algorithm.
See also the website below for examples of recent research in this area: www.algorithmsinnature.org
Pre-requisite: 15-210, no prior biological knowledge required.
02-718 Computational Medicine
Modern medical research increasingly relies on the analysis of large patient datasets to enhance our understanding of human diseases. This course will focus on the computational problems that arise from studies of human diseases and the translation of research to the bedside to improve human health. The topics to be covered include computational strategies for advancing personalized medicine, pharmacogenomics for predicting individual drug responses, metagenomics for learning the role of the microbiome in human health, mining electronic medical records to identify disease phenotypes, and case studies in complex human diseases such as cancer and asthma. We will discuss how machine learning methodologies such as regression, classification, clustering, semi-supervised learning, probabilistic modeling, and time-series modeling are being used to analyze a variety of datasets collected by clinicians. Class sessions will consist of lectures, discussions of papers from the literature, and guest presentations by clinicians and other domain experts. Grading will be based on presentations, assignments, participation, and a project.
This course will provide an introduction to genomics, epigenetics, and their application to problems in neuroscience. The rapid advances in single cell sequencing and other genomic technologies are revolutionizing how neuroscience research is conducted, providing tools to study how different cell types in the brain produce behavior and contribute to neurological disorders. Analyzing these powerful new datasets requires a foundation in molecular neuroscience as well as key computational biology techniques. In this course, we will cover the biology of epigenetics, how proteins sitting on DNA orchestrate the regulation of genes. In parallel, programming assignments and a project focusing on the analysis of a primary genomic dataset will teach principles of computational biology and their applications to neuroscience. The course material will also serve to demonstrate important concepts in neuroscience, including the diversity of neural cell types, neural plasticity, the role that epigenetics plays in behavior, and how the brain is influenced by neurological and psychiatric disorders. Although the course focuses on neuroscience, the material is accessible and applicable to a wide range of topics in biology.
02-721 Algorithms for Computational Structural Biology
Some of the most interesting algorithmic challenges in Biology and Bioengineering arise from the modeling, simulation, and engineering of biological macromolecules at, or near atomic resolution. This course covers a variety of algorithms used to study and engineer the structure, dynamics, and function of proteins, nucleic acids, and other molecules. It is intended for graduates and advanced undergraduates who are interested in topics such as protein folding, protein interactions, and computer-aided design of drugs and proteins. Students should have some experience with programming as well as introductory coursework in the design and analysis of algorithms. The course begins with a review of the necessary Biology, Chemistry, and Physics for those who haven’t seen these topics since high school. The topics covered will include algorithms for solving optimization, inference, simulation, and sampling problems that arise in the fields of structural and synthetic biology. Coursework will include 4 to 5 problems sets and an independent or group final project. Open to students with backgrounds in computer science or the life sciences, or by permission of the instructor.
02-722 Advanced Algorithms for Computational Structural Biology
This is a seminar-style course on the current literature in computational structural biology. Topics will include algorithms for designing drugs and proteins, as well as protein structure prediction and simulation. Students will be expected to read and discuss papers and complete a project of their own design. Open to students with backgrounds in computer science and structural biology, or by permission of the instructor.
Proteomics and metabolomics are the large scale study of proteins and metabolites, respectively. In contrast to genomes, proteomes and metabolomes vary with time and the specific stress or conditions an organism is under. Applications of proteomics and metabolomics include determination of protein and metabolite functions (including in immunology and neurobiology) and discovery of biomarkers for disease. These applications require advanced computational methods to analyze experimental measurements, create models from them, and integrate with information from diverse sources. This course specifically covers computational mass spectrometry, structural proteomics, proteogenomics, metabolomics, genome mining and metagenomics.
02-730 Cell and Systems Modeling
This course will introduce students to the theory and practice of modeling biological systems from the molecular to the organism level with an emphasis on intracellular processes. Topics covered include kinetic and equilibrium descriptions of biological processes, systematic approaches to model building and parameter estimation, analysis of biochemical circuits modeled as differential equations, modeling the effects of noise using stochastic methods, modeling spatial effects, and modeling at higher levels of abstraction or scale using logical or agent-based approaches. A range of biological models and applications will be considered including gene regulatory networks, cell signaling, and cell cycle regulation. Weekly lab sessions will provide students hands-on experience with methods and models presented in class. Course requirements include regular class participation, bi-weekly homework assignments, a take-home exam, and a final project. Prerequisites: The course is designed for graduate and upper-level undergraduate students with a wide variety of backgrounds. The course is intended to be self-contained but students may need to do some additional work to gain fluency in core concepts. Students should have a basic knowledge of calculus, differential equations, and chemistry as well as some previous exposure to molecular biology and biochemistry. Experience with programming and numerical computation is useful but not mandatory. Laboratory exercises will use Matlab as the primary modeling and computational tool augmented by additional software as needed.
02-731 Modeling Evolution
Some of the most serious public health problems we face today, from drug-resistant bacteria, to cancer, all arise from a fundamental property of living systems—their ability to evolve. Since Darwin’s theory of natural selection was first proposed, we have begun to understand how heritable differences in reproductive success drive the adaptation of living systems. This makes it intuitive and tempting to view evolution from an optimization perspective. However, genetic drift, phenotypic trade-offs, constraints, and changing environments, are among the many factors that may limit the optimizing force of natural selection. This tug-of-war between selection and drift, between the forces that produce variation in a population, and the forces suppressing this variation, make evolutionary processes much more complex to model and understand than previously thought.
The aim of this class is to provide an introduction into the theoretical formalism necessary to understand how biological systems are shaped by the forces and constraints driving evolutionary dynamics. I will introduce population genetic theory as a lens for the understanding and interpretation of modern datasets, such as datasets of human world-wide genomic and epigenomic variation or tumor genomic heterogeneity. By the end of the course, you should have learned to build evolutionary models, as well as the basic differences between idealized models and the data you might encounter in real life. The class is group-project based and you will work together to explore open questions in evolution.
02-740 Bioimage Informatics
With the rapid advance of bioimaging techniques and fast accumulation of bioimage data, computational bioimage analysis and modeling are playing an increasingly important role in understanding of complex biological systems. The goals of this course are to provide students with the ability to understand a broad set of practical and cutting-edge computational techniques to extract knowledge from bioimages. Such techniques include image filtering, image feature detection, image classification, image segmentation, object detection, object tracking, image retrieval, image mining and image modeling using both traditional and deep learning methods. Upon successful completion of this course, the student will be able to: explain the importance and understand the principles and uses of both geometrical and machine learning-based bioimage analysis techniques; understand how these techniques can be combined for various applications; develop code to implement basic techniques; and solve specific bioimage analysis tasks using image-processing libraries. Coursework will include homework, two in-class examinations, and doing an independent project on a practical bioimaging problem. Students are expected to have some experience with programming in python.
Automated scientific instruments are used widely in research and engineering. Robots dramatically increase the reproducibility of scientific experiments, and are often cheaper and faster than humans, but are most often used to execute brute-force sweeps over experimental conditions. The result is that many experiments are “wasted” on conditions where the effect could have been predicted. Thus, there is a need for computational techniques capable of selecting the most informative experiments.
This course will introduce students to techniques from Artificial Intelligence and Machine Learning for automatically selecting experiments to accelerate the pace of discovery and to reduce the overall cost of research. Real-world applications from Biology, Bioengineering, and Medicine will be studied. Grading will be based on homeworks and two exams. The course is intended to be self-contained, but students should have a basic knowledge of biology, programming, statistics, and machine learning.
Computational biologists frequently focus on analyzing and modeling large amounts of biological data, often from high-throughput assays or diverse sources. It is therefore critical that students training in computational biology be familiar with the paradigms and methods of experimentation and measurement that lead to the production of these data. This one-semester laboratory course has been developed to give students a deep appreciation of the principles and challenges of biological experimentation. Students will explore a range of topics, including structural biology, genomics, proteomics, and bioimaging. Each broad topic is covered over a period of 3-4 weeks. Many lectures and labs are hosted by faculty who are experts in the field. Students are required to keep a detailed laboratory notebook, summarizing the goals of the experiment, critical steps, and analysis of the resulting data. With an emphasis on instrumentation and high-throughput data collection, this course is appropriate for students who have never taken a traditional undergraduate biology lab course, as well as those who have. Grading: Letter grade based on class participation, take-home exams, and a final project.
This is a graduate level laboratory-based course designed to teach technical and biological laboratory skills used to design and execute automated biological experiments. Students will learn the principles, experimental paradigms, and techniques for automating biological experimentation with the goal of enabling complete automation of biological experimentation. Students will learn the biological principles underlying various automatable experimental methods, the design concepts for automated experiments, engineering elements enabling hardware for preparing samples and doing automated data collection, and software for controlling that hardware. These topics will be taught in lectures as well as through laboratory experience using multi-purpose laboratory robotics. Instruments used will include liquid handling robots, plate readers, and automated microscopes. Grading will be based mainly on satisfactory completion of assignments.
This laboratory course provides a continuation and extension of experiences in 02-761. Instruction will consist of lectures and laboratory experience using multi-purpose laboratory robotics. During weekly laboratory time, students will complete and integrate parts of two larger projects. The first project will be focused on the execution of a molecular biology experiment requiring nucleic acid extraction, library preparation for sequencing, and quality control. The second project will be focused on the implementation and execution of automated methods using active learning techniques to direct the learning of a predictive model for a large experimental space (such as learning the effects of many possible drugs on many possible targets). Grading will be based on lab and project completion and quality.
This course is for students participating in an internship or co-op.
02-900 Ph.D. Thesis Research
This course is for students enrolled in the Ph.D. program working on research.