Tools Needed for Automating Science: Formalizing the use of Active Machine Learning to Drive Experimentation
Robert F. Murphy, Joshua D. Kangas, Christopher J. Langmead
Computational Biology Department, School of Computer Science, Carnegie Mellon University
October 19, 2017
There is a need to develop and deploy advanced technologies for fully automating the execution of science and engineering projects. These technologies could dramatically decrease the costs of research and engineering, while increasing throughput and reproducibility. Existing platforms for automating research merely execute experiments selected by humans. What is needed are generalizable technologies (open source software and community standards) capable of closed-loop hypothesis generation from available data, experiment selection, and automated execution.
Many biological and chemical systems are too complex for humans to understand completely, due to their scale and their nonlinear and stochastic behaviors. Traditionally, scientists and engineers choose and perform experiments to test hypotheses or to optimize designs. As the system’s complexity increases, the number of possible experiments that could be performed to study it rises exponentially, and, since resource constraints limit the number of experiments performed, we are faced with the need to select a maximally informative set of experiments from a combinatorial space of possible experiments, trying to optimize financial or other constraints.
Unfortunately, the human mind is not well suited to solving this type of optimization problem, due most often to our inability to form predictive models at the scales involved. The result is that, in practice, many human-selected experiments are “wasted” on conditions where no effect is observed or, more importantly, where the effect is predictable from other experiments, given computational assistance. This waste of resources ultimately limits what scientists and engineers can accomplish. This type of problem is the realm of Active Learning [1-6], a sub-domain of Machine Learning focused on algorithms for iteratively choosing experiments expected to optimally improve an underlying computational model (Figure 1).
While active learning could provide benefits for essentially all large scale screening and experimentation, such as drug development [7-8], there are significant barriers to its routine use. Perhaps the most significant is the absence of robust, readily available software to facilitate use by any group embarking on large scale experimentation. We therefore suggest the need for the development of open source tools and open access standards to enable the routine use of active learning driven experimentation. We suggest tools are needed for four connected tasks: random access experimentation; experimental data analysis; predictive model construction; and active learning experiment selection (Figure 2).
The first component is the most involved, in that it may be highly specialized for particular types of experiments. However, the first step is to have the experimenter communicate to an automated system the specifics of how to perform an experiment and what experiments are allowed (e.g., which cell lines and drugs may be chosen from). The former is simply a protocol that, for example, liquid-handling robots and automated measurement systems are used to execute and open standards currently exist. The latter is simply defining the source plates/libraries. However, most current systems can only run the protocol for entire rows, columns or plates. The key to using such systems for active learning is to allow a computer to specify a particular set of experiments to perform that does not conform to these limitations (e.g., cells 1a, 4c, 9f, 2e, etc.). We suggest the need for collaboration between software developers, instrument manufacturers and contract research organizations to implement such systems and create open standards for a computer to communicate a desired set of experiments to an automated system without human intervention.
The second component, predictive model generation, is also specialized for a particular type of data or problem, and would typically be paired with a particular protocol or instrument type. However, much work has been done on automated analysis and modeling pipelines for various data sources , the interfaces to which can be standardized.
There has been significant work on the third of these components in the context of large experimental spaces. This first is on matrix completion methods that construct a predictive model for an entire space given data for some parts of that space. For example, work has been done on constructing a predictive model of drug-target interactions in the setting of drug discovery [9-11]. However, that work has focused on the task of predicting which new drugs will interact with known targets given data on the interactions of known drugs with those targets. In most settings, this has meant providing complete data for the subset of known drugs in order to train the model (i.e., values for many complete columns of the drug-target matrix); the assumption was that one was going to do no new experiments but simply try to predict their outcomes from a large body of comprehensive data. Recent results for the setting in which the training data is non-uniformly distributed over the drug-target matrix has been done [12,13].
The fourth component is active learning engines. Active learning has been studied in many contexts and for a number of different criteria for choosing experiments. However, the vast majority of this work has been retrospective: a large, complete dataset is ‘hidden’ from the active learner and individual data points are revealed upon request. This setting enables the calculation of the accuracy of the model at any point in the active learning process because all of the data is actually available. This setting does not apply to any real-world application in which the point is to avoid collecting all of the data. Additional work is needed on approaches for estimating the accuracy of an actively-learned model so that we can know when the model is good enough to stop doing acquisition
Conclusion: Automation is the future of science and engineering. It will dramatically reduce the costs of discovery and development, while increasing throughput and reproducibility. More importantly, the use of automated model building and experiment selection via active learning will overcome the limits of the human mind, when it comes to reasoning about complex systems and the data they produce
Figure 1. The Active Learning Cycle. The key is to iteratively select and execute experiments based on the current predictive model. Note that this is not the traditional “systems biology” approach that focuses on constructing a predictive model using data from a very large set of experiments and then trying to “prove” the model by doing selected additional experiments to verify high-confidence predictions. Such approaches ignore the fact that it is impossible to prove empirical models, and that the most appropriate use of new data is to improve a model! Note also that this different from trying to predict everything in silico – the active learning approach optimally combines computational prediction and experimental data acquisition.
Figure 2. Components to be developed by the STC. 1) Tools for the experimenter to communicate to an automated system the specifics of what experiments are allowed (e.g., which cell lines and compounds may be chosen from) and how to perform them. 2) Tools for processing measurements (e.g., image analysis). These are specific to each type of study. 3) Tools for converting processed data into predictive models. This uses traditional machine learning methods or system identification methods, depending on the study. 4) Active learning engines. Most past work has been retrospective: a large, complete dataset is ‘hidden’ from the active learner and individual data points are revealed upon request. The STC will demonstrate the utility in real-world, prospective settings.
Figure 3. Active Learning Examples. In a retrospective study of drug effects, 57% of active compounds were discovered with only 2.5% of possible experiments . In a prospective study, a 92% accurate model of complex phenotypes was obtained after only 28% of possible experiments .
- B. Settles (2010), “Active Learning Literature Survey”, Computer Sciences Technical Report 1648. University of Wisconsin–Madison
- M. Balcan, A. Beygelzimer, J. Langford (2009) Agnostic active learning, J. Comp and Systems Sci. 75(1), 78-89.
- M. Balcan, S. Hanneke, JW Vaughan (2010) The true sample complexity of active learning, Machine Learning 80(2), 111-139
- G. Dasarathy, A. Singh, M. Balcan, JH Park (2016), Active Learning Algorithms for Graphical Model Selection, Proceedings of Machine Learning Research (AISTATS) 51:1356-1364.
- Y. Wang and A. Singh (2016) Noise-adaptive Margin-based Active Learning for Multi-dimensional Data and Lower Bounds under Tsybakov Noise Condition, AAAI Conference on Artificial Intelligence, AAAI 2016
- P. Donmez, JG Carbonell (2008) Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles, in Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM ’08), pp. 619-628.
- R. F. Murphy (2011) An active role for machine learning in drug development. Nature Chemical Biology 7:327-330.
- D. Reker and G. Schneider (2015) Active-learning strategies in computer-assisted drug discovery. Drug Discovery Today 20:458-65.
- M. Gönen (2012) Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28, 2304–2310.
- X. Zheng, H. Ding, H. Mamitsuka and S. Zhu (2013) Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1025-1033}
- Y. Yamanishi, M. Araki and A. Gutteridge, W. Honda and M. Kanehisa. (2015) Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Methods 83, 98-104.
- A. W. Naik, J. D. Kangas, C. J. Langmead and R. F. Murphy (2013) Efficient Modeling and Active Learning Discovery of Biological Responses. PLoS ONE 8: e83996.
- J.D. Kangas, A.W. Naik, and R.F. Murphy (2014) Prediction of Biological Responses Using Protein and Compound Features and their Discovery using Active Learning. BMC Bioinformatics 15:143.
- A.W. Naik, J.D. Kangas, D. P. Sullivan, and R. F. Murphy (2016) Active Machine Learning-driven Experimentation to Determine Compound Effects on Protein Patterns, eLife 5:e10047.