Abstract
Deep learning approaches have been increasingly applied to the discovery of novel chemical compounds. These predictive approaches can accurately model compounds and increase true discovery rates, but they are typically black box in nature and do not generate specific chemical insights. Explainable deep learning aims to ‘open up’ the black box by providing generalizable and human-understandable reasoning for model predictions. These explanations can augment molecular discovery by identifying structural classes of compounds with desired activity in lieu of lone compounds. Additionally, these explanations can guide hypothesis generation and make searching large chemical spaces more efficient. Here we present an explainable deep learning platform that enables vast chemical spaces to be mined and the chemical substructures underlying predicted activity to be identified. The platform relies on Chemprop, a software package implementing graph neural networks as a deep learning model architecture. In contrast to similar approaches, graph neural networks have been shown to be state of the art for molecular property prediction. Focusing on discovering structural classes of antibiotics, this protocol provides guidelines for experimental data generation, model implementation and model explainability and evaluation. This protocol does not require coding proficiency or specialized hardware, and it can be executed in as little as 1–2 weeks, starting from data generation and ending in the testing of model predictions. The platform can be broadly applied to discover structural classes of other small molecules, including anticancer, antiviral and senolytic drugs, as well as to discover structural classes of inorganic molecules with desired physical and chemical properties.
Key points
-
This protocol enables the computational discovery of chemical compounds using a deep learning architecture called graph neural networks, which, given the chemical structure of any compound, can predict whether the compound has a property of interest.
-
The platform leverages explainable deep learning to facilitate the identification of structural classes of novel compounds. This approach guides hypothesis generation and makes searching large chemical spaces more efficient compared with previous approaches, which are typically black box in nature and do not generate specific chemical insights.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Example datasets are available as Supplementary Information. The main datasets used in this protocol are subsets of data from a previously published study (ref. 7) identifying a structural class of antibiotics using explainable DL.
Code availability
Chemprop is available at https://github.com/chemprop/chemprop. A working example of the files provided as inputs and created as outputs of this protocol is available at https://github.com/felixjwong/protocol. Additional code from a previously published study, which includes Chemprop checkpoints for models trained on larger datasets, are available at https://github.com/felixjwong/antibioticsai and https://zenodo.org/records/10095879 (ref. 78). The code in this protocol has been peer reviewed.
References
Wong, F. et al. Leveraging artificial intelligence in the fight against infectious diseases. Science 381, 164–170 (2023).
Wan, F., Wong, F., Collins, J. J. & de la Fuente-Nunez, C. Machine learning for antimicrobial peptide identification and design. Nat. Rev. Bioeng. 2, 392–407 (2024).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Wong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature 626, 177–185 (2024).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Bengio, Y., Lodi, A. & Prouvost, A. Machine learning for combinatorial optimization: a methodological tour d’horizon. Eur. J. Oper. Res. 290, 405–421 (2021).
Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).
Yang, X., Wang, Y., Byrne, R., Schneider, G. & Yang, S. Concepts of artificial intelligence for computer-assisted drug discovery. Chem. Rev. 119, 10520–10594 (2019).
Burbidge, R., Trotter, M., Buxton, B. & Holden, S. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput. Chem. 26, 5–14 (2001).
Warmuth, M. K. et al. Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43, 667–673 (2003).
Zernov, V. V., Balakin, K. V., Ivaschenko, A. A., Savchuk, N. P. & Pletnev, I. V. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J. Chem. Inf. Comput. Sci. 43, 2048–2056 (2003).
Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673–685 (2023).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).
Liu, G. et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat. Chem. Biol. 19, 1342–1350 (2023).
Zheng, E. J. et al. Discovery of antibiotics that selectively kill metabolically dormant bacteria. Cell. Chem. Biol. 31, 712–728.e9 (2024).
Melo, M. C. R., Maasch, J. R. M. A. & de la Fuente-Nunez, C. Accelerating antibiotic discovery through artificial intelligence. Commun. Biol. 4, 1050 (2021).
Cesaro, A., Bagheri, M., Torres, M., Wan, F. & de la Fuente-Nunez, C. Deep learning tools to accelerate antibiotic discovery. Expert Opin. Drug Discov. 18, 1245–1257 (2023).
Krishnan, S. R. et al. De novo design of anti-tuberculosis agents using a structure-based deep learning method. J. Mol. Graph. Model. 118, 108361 (2023).
Wong, F. et al. Discovering small-molecule senolytics with deep neural networks. Nat. Aging 3, 734–750 (2023).
Jin, W. et al. Deep learning identifies synergistic drug combinations for treating COVID-19. Proc. Natl Acad. Sci. USA 118, e2105070118 (2021).
Preuer, K. et al. DeepSynergy: predicting anti-cancer drug synergy with deep learning. Bioinformatics 34, 1538–1546 (2018).
Wan, F., Kontogiorgos-Heintz, D. & de la Fuente-Nunez, C. Deep generative models for peptide design. Digit. Discov. 1, 195–208 (2022).
De Cao, N. & Kipf, T. MolGAN: an implicit generative model for small molecular graphs. In ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models (2018).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Jin, W., Barzilay, R. & Jaakkola, T. In Proc. 35th International Conference on Machine Learning 2323–2332 (2018).
Blaschke, T. et al. REINVENT 2.0: an AI tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020).
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 10752 (2019).
Zeng, X. et al. Deep generative molecular design reshapes drug discovery. Cell Rep. Med. 3, 100794 (2022).
Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).
Ying, R., Bourgeois, D., You, J., Zitnik, M. & Leskovic, J. GNNExplainer: generating explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 32, 9240–9251 (2019).
Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).
Yuan, H., Yu, H., Gui, S. & Ji, S. Explainability in graph neural networks: a taxonomic survey. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5782–5799 (2023).
Yuan, H., Yu., H., Wang, J., Li, K. & Ji, S. In Proc. 38th International Conference on Machine Learning 12241–12252 (2021).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
Gilmer, J. et al. In Proc. 34th International Conference on Machine Learning 1263–1272 (2017).
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 2162–2388 (2021).
Zhou, J. et al. Graph neural networks: a review of methods and applications. AI Open 1, 57–81 (2020).
Reiser, P. et al. Graph neural networks for materials science and chemistry. Commun. Mater. 3, 93 (2022).
Heid, E. & Green, W. H. Machine learning of reaction properties via learned representations of the condensed graph of reaction. J. Chem. Inf. Model. 62, 2101–2110 (2022).
Jin, W., Barzilay, R. & Jaakkola, T. In Proc. 37th International Conference on Machine Learning 4849–4859 (2020).
Heid, E. et al. Chemprop: a machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17 (2024).
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo Tree Search. In Computers and Games (CG 2006). Lecture Notes in Computer Science (eds van den Herik, H. J. et al.) 4630, 72–83 (Springer, 2007)
Tingle, B. I. et al. ZINC-22—a free multi-billion-scale database of tangible compounds for ligand discovery. J. Chem. Inf. Model. 63, 1166–1176 (2023).
Verheij, H. J. Leadlikeness and structural diversity of synthetic screening libraries. Mol. Divers. 10, 377–388 (2006).
Krier, M., Bret, G. & Rognan, D. Assessing the scaffold diversity of screening libraries. J. Chem. Inf. Model. 46, 512–524 (2006).
Swanson, K. et al. ADMET-AI: a machine learning ADMET platform for evaluation of large-scale chemical libraries. Bioinformatics 40, btae416 (2024).
McGill, C., Forsuelo, M., Guan, Y. & Green, W. H. Predicting infrared spectra with message passing neural networks. J. Chem. Inf. Model. 61, 2594–2609 (2021).
Swinney, D. C. & Anthony, J. How were new medicines discovered. Nat. Rev. Drug Discov. 10, 507–519 (2011).
Swinney, D. C. Phenotypic vs. target-based drug discovery for first-in-class medicines. Clin. Pharmacol. Ther. 93, 299–301 (2013).
Moffat, J. G., Vincent, F., Lee, J. A., Eder, J. & Prunotto, M. Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nat. Rev. Drug Discov. 16, 531–543 (2017).
Muratov, E. N. et al. QSAR without borders. Chem. Soc. Rev. 49, 3525–3564 (2020).
Wong, F. et al. Benchmarking AlphaFold‐enabled molecular docking predictions for antibiotic discovery. Mol. Syst. Biol. 18, e11081 (2022).
Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799–4832 (2021).
Loyola-González, O. Black-box vs. white-box: understanding their advantages and weaknesses from a practical point of view. IEEE Access 7, 154096–154113 (2019).
Clinical and Laboratory Standards Institute. M100: Performance Standards for Antimicrobial Susceptibility Testing (2021).
Zhang, J. H., Chung, T. D. & Oldenburg, K. R. A simple statistical parameter for use in evaluation and validation of high throughput screening assays. J. Biomol. Screen. 4, 67–73 (1999).
Kim, S. et al. PubChem substance and compound databases. Nucleic Acids Res. 44, D1202–D1213 (2016).
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
Degtyarenko, K. et al. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36, D344–D350 (2008).
Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
Williams, A. J. et al. The CompTox Chemistry Dashboard: a community data resource for environmental chemistry. J. Cheminform. 9, 61 (2017).
Bergerhoff, G., Hundt, R., Sievers, R. & Brown, I. D. The inorganic crystal structure data base. J. Chem. Inf. Comput. Sci. 23, 66–69 (1983).
Belsky, A., Hellenbrandt, M., Karen, V. L. & Luksch, P. New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design. Acta Cryst. 58, 364–369 (2022).
Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data 6, 203 (2019).
Shivanyuk, A., Ryabukhin, S. V., Bogolubsky, A. V. & Tolmachev, A. Enamine REAL database: making chemical diversity real. Chem. Today 25, 58–59 (2007).
Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51, 1281–1289 (2018).
Fink, T., Bruggesser, H. & Reymond, J.-L. Virtual exploration of the small-molecule chemical universe below 160 Daltons. Angew. Chem. Int. Ed. 44, 1504–1508 (2005).
Fink, T. & Reymond, J.-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. J. Chem. Inf. Model. 47, 342–353 (2007).
Blum, L. C. & Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131, 8732–8733 (2009).
Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52, 2864–2875 (2012).
Baell, J. B. & Holloway, G. A. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 53, 2719–2740 (2010).
Brenk, R. et al. Lessons learnt from assembling screening libraries for drug discovery for neglected diseases. ChemMedChem 3, 435–444 (2008).
Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Dis. Rev. 23, 3–25 (1997).
Wong, F. et al. Supporting code for: discovery of a structural class of antibiotics with explainable deep learning. Zenodo https://doi.org/10.5281/zenodo.10095879 (2023).
Samuel, A. L. Some studies in machine learning using the game of checkers. IBM J. 3, 211–229 (1959).
Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Krizhensky, A., Sutskever, I. & Hinton, G. E. In Advances in Neural Information Processing Systems 1106–1114 (2012).
Vaswani, A. et al. In Advances in Neural Information Processing Systems (2017).
Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024).
Lundberg, S. M. and Lee, S.-I. In Proc. 31st International Conference on Neural Information Processing Systems 4768–4777 (2017).
Ribeiro, M. T., Singh, S. & Guestrin, C. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (2016).
Dai, H., Dai, B. & Song, L. In Proc. 33rd International Conference on Machine Learning 2702–2711 (2016).
Buterez, D., Janet, J. P., Kiddle, S. J., Oglic, D. & Lió, P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat. Commun. 15, 1517 (2024).
Xie, T., France-Lanord, A., Wang, Y., Shao-Horn, Y. & Grossman, J. Y. Graph dynamical networks for unsupervised learning of atomic scale dynamics in materials. Nat. Commun. 10, 2667 (2019).
Yun, S., Jeong, M., Kim, R. Kang, J. & Kim, H. J. In 33rd Conference on Neural Information Processing Systems 11983–11993 (2019).
Acknowledgements
F.W. was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under award number K25AI168451. A.K. was supported by the Swiss National Science Foundation under grant number SNSF_ 203071. J.J.C. was supported by the Defense Threat Reduction Agency (grant numbers HDTRA12210032 and HDTRA12210010), the National Institutes of Health (grant number R01-AI146194) and the Broad Institute of MIT and Harvard. This work is part of the Antibiotics-AI Project, which is directed by J.J.C. and supported by the Audacious Project, Flu Lab, LLC, the Sea Grape Foundation, Rosamund Zander and Hansjorg Wyss for the Wyss Foundation and an anonymous donor.
Author information
Authors and Affiliations
Contributions
F.W. prepared the manuscript and supervised research. S.O., A.L., A.K., R.S.L., J.R. and M.Z.W. contributed to writing and validating the protocol steps. J.J.C. supervised research. All authors assisted with manuscript editing.
Corresponding author
Ethics declarations
Competing interests
J.J.C. is an academic cofounder and Scientific Advisory Board chair of EnBiotix, an antibiotic drug discovery company and Phare Bio, a nonprofit venture focused on antibiotic drug development. J.J.C. is also an academic cofounder and board member of Cellarity and the founding Scientific Advisory Board chair of Integrated Biosciences. F.W. and M.Z.W. are cofounders of Integrated Biosciences. S.O., A.L. and R.S.L. contributed to this work as employees of Integrated Biosciences, and S.O. and R.S.L. may have equity interest in Integrated Biosciences. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Protocols thanks Octavio Franco and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key references
Wong, F. et al. Nature 626, 177–185 (2024): https://doi.org/10.1038/s41586-023-06887-8
Wong, F. et al. Nat. Aging 3, 734–750 (2023): https://doi.org/10.1038/s43587-023-00415-z
Liu, G. et al. Nat. Chem. Biol. 19, 1342–1350 (2023): https://doi.org/10.1038/s41589-023-01349-8
Stokes, J. M. et al. Cell 180, 688–702.e13 (2020): https://doi.org/10.1016/j.cell.2020.01.021
Supplementary information
Supplementary Data 1
Sample training data for antibacterial activity against S. aureus RN4220.
Supplementary Data 2
Sample training data for cytotoxicity against HepG2 cells.
Supplementary Data 3
Sample training data for cytotoxicity against HSkM cells.
Supplementary Data 4
Sample training data for cytotoxicity against IMR-90 cells.
Supplementary Data 5
Sample test data for 100,000 compounds from a Broad Institute database.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wong, F., Omori, S., Li, A. et al. An explainable deep learning platform for molecular discovery. Nat Protoc (2024). https://doi.org/10.1038/s41596-024-01084-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41596-024-01084-x