SELFIES and the future of molecular string representations

Mario Krenn(Max Planck Institute for the Science of Light), Qianxiang Ai(Fordham University), Senja Barthel(Vrije Universiteit Amsterdam), Nessa Carson(Syngenta (United Kingdom)), Angelo Frei(Imperial College London), Nathan C. Frey(Massachusetts Institute of Technology), Pascal Friederich(Karlsruhe Institute of Technology), Théophile Gaudin(University of Toronto), Alberto Alexander Gayle, Kevin Maik Jablonka(École Polytechnique Fédérale de Lausanne), Rafael F. Lameiro(Universidade de São Paulo), Dominik Lemm(University of Vienna), Alston Lo(University of Toronto), Seyed Mohamad Moosavi(Freie Universität Berlin), José Manuel Nápoles-Duarte(Autonomous University of Chihuahua), AkshatKumar Nigam(Stanford University), Robert Pollice(University of Toronto), Kohulan Rajan(Friedrich Schiller University Jena), Ulrich Schatzschneider(University of Würzburg), Philippe Schwaller(IBM Research - Zurich), Marta Skreta(University of Toronto), Berend Smit(École Polytechnique Fédérale de Lausanne), Felix Strieth‐Kalthoff(University of Toronto), Chong Sun(University of Toronto), Gary Tom(University of Toronto), Guido Falk von Rudorff(University of Vienna), Andrew Z. Wang(University of Toronto), Andrew Dickson White(University of Rochester), Adamo Young(University of Toronto), Rose Yu(University of California San Diego), Alán Aspuru‐Guzik(Canadian Institute for Advanced Research)
Patterns
October 1, 2022
Cited by 261Open Access
Full Text

Abstract

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings-most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.


Related Papers

No related papers found

Powered by citation graph analysis