"Geometric Conformation Enhanced AI Algorithm", Baidu's biocomputing research results published in the sub-journal of "Nature"

Released by Heart of Machine

Heart of Machine Editorial Office

Recently, Baidu was published in Nature Machine Intelligence (Impact score 16.65), a subsidiary of the top international journal Nature Published the latest research result of AI + biocomputing "Geometry Enhanced Molecular Representation Learning for Property Prediction", and proposed a "spatial structure-based compound representation learning method", namely "Geometry Enhanced Molecular Representation Learning, GEM model". ), revealing a compound modeling method based on three-dimensional spatial structure information and its application in drug discovery.

Paper link: https://www.nature.com/articles/s42256-021-00438-4

According to public information, "Machine Intelligence" is a top-level machine learning field under "Nature" The journal has an impact factor of more than 16 in the past two years. In this research, the PaddleHelix team of Baidu Propeller introduced the geometric structure information of compounds into self-supervised learning and molecular representation models for the first time, and achieved SOTA in more than a dozen downstream attribute prediction tasks, becoming Baidu's external partner in the field of AI-enabled drug research and development. Another major achievement made public.

Seeking changes in the field of pharmaceutical research,AI + biocomputing is the best choice

It is well known that drug development is costly, time-consuming and risky. According to a 2014 study by Tufts University, the average cost of a new drug entering the market is about $2.6 billion, the average time from first synthesis to clinical trial is 31.2 months, and the time from Phase I clinical to market launch is as long as 31.2 months. 96.8 months. On the other hand, as the world enters an aging society, the demand for new drugs is also increasing year by year, and the total size of the global pharmaceutical market will exceed 11 trillion by 2024. In contrast, the number of new drugs on the market for every $1 billion invested by pharmaceutical companies has been declining year by year. How to quickly find potential drug candidates through new technical means and reduce the risk of failure to enter clinical trials has become the most urgent problem in the field of drug research and development.

Before the advent of computational methods, drug research and development basically used biological experiments to find drugs, which was expensive and time-consuming. With the development of computational chemistry and computational biology , there are also traditional machine learning methods to assist in the development. Drug design, but these methods are more or less insufficient in terms of effect and efficiency. Taking small molecules as an example, to find a candidate drug, the order of magnitude of screening (search) reaches 10 to the 60th power, and traditional calculation methods are difficult to be efficient. Finish. On the other hand, with the development and popularization of AI technology, drug research and development has gradually entered the AI ​​era. The deep learning technology of AI , which is naturally good at processing big data, has become the focus of everyone's attention in recent years. Drug R&D efficiency, reduce the probability of late failure, and reduce drug R&D costs.

The main purpose of compound property prediction is to discover compounds with substandard physicochemical properties in time, so as to reduce the risk of candidate compounds failing to enter clinical trials and improve the success rate of drug development. The traditional compound property prediction analysis generally adopts the experimental method.Costly and time-consuming. There are also some works based on AI algorithms in the industry, but most of them use the two-dimensional information of compounds, and do not incorporate the three-dimensional spatial structure information of compounds. For the first time, Baidu proposed to introduce the spatial structure information of compounds into compound pre-training, characterize compound molecules through geometrically enhanced self-supervised learning, and independently infer spatial structure information through the characterization of compounds, and then predict the properties of compound molecules. , to assist in drug research and development, improve efficiency and reduce costs.

It is worth mentioning that the research was independently completed by the PaddleHelix biocomputing team of Baidu Propeller, and has already been implemented in the field of drug research and development with partners in the early drug research and development pipeline.

Baidu GEM model accelerates drug discovery process

Many research works have demonstrated the great potential of machine learning technology, especially deep learning in compound property prediction, which uses sequence ( SMILES expressions) or graphs (atoms are nodes and bonds are edges) to represent compounds, and sequence modeling or graph neural networks (GNNs) are used to predict the properties of compounds. Some studies directly treat each compound as a graph and use self-supervised learning methods based on graph topology for molecular characterization, such as masking and reducing atoms, chemical bonds or substructures in compound graphs. However, these methods only regard compounds as topological maps, and do not fully utilize the geometric structure information of compounds. The geometric structure of a compound, that is, the three-dimensional spatial structure, plays a key role in the physical, chemical, biological and other properties of the compound, and the spatial structure of two compounds with the same topology may be completely different. On the other hand, due to the complex operation and high cost of biological experiments, the labeling data of compounds are scarce and precious. Sparse data makes deep neural networks easy to overfit, and it is difficult to exert powerful modeling capabilities. How to learn high-quality compound characterizations from a large number of unlabeled compounds has become the key to compound modeling and property prediction.

In view of this, Baidu proposes a new compound modeling method based on spatial structure - geometric conformation enhanced AI algorithm GEM, and designs multiple geometric levels of self-supervised learning strategies for learning the spatial structure of compounds This knowledge enables the characterization of compounds to autonomously infer spatial structural information. This technique has achieved excellent results on more than ten benchmark compound attribute prediction datasets, and has been successfully applied to the ADMET druggability prediction task of candidate compounds with good yields.

Interpretation of the geometric conformation augmentation AI algorithm GEM model

Geometry conformation augmentation AI algorithm The GEM model consists of two main parts: a spatial structure-based graph neural network (a) and self-supervision at multiple levels of geometry Learning task (b).

Figure 1: Overall framework of GEM

  • Spatial structure-based graph neural network - The bond angle is determined. GEM proposes a spatial structure-based graph network, which simultaneously models the spatial structure information for atom-chemical bond-angle relationships. Each compound consists of two graphs: Atom-bond graph G and bond-angle graph H. Similar to previous work, the graph G of atoms- bonds takes atoms as nodes of the graph and bonds as edges connecting atoms. The chemical bond-angle graph H is introduced for the first time. The chemical bond is used as the node of the graph, and the bond angle formed by the two chemical bonds is the edge of the graph.The graph neural network consists of multiple iterations, and the chemical bond acts as a bridge between the graph G and the graph H in each iteration to exchange information. The characterization of the last round of iterations was used for compound property prediction.

    • Self-Supervised Learning Based on Spatial Structure

    In order to make the model better learn chemical space knowledge, GEM not only uses geometric information as input, but further designs learning tasks based on geometric information (Objective): Predict the length of chemical bonds; Predict the bond angles formed by chemical bonds; Predict the distance between two atoms. Among them, the bond length and bond angle describe the local structure of the compound, while the distance between two atoms is more concerned with the global structure of the compound. The self-supervised learning task describing the local structure randomly selects a subgraph centered on an atom in a compound and covers it, and predicts the bond length and bond angle formed between the chemical bonds in the covered subgraph. The self-supervised learning task describing the global structure estimates the elements in the atomic distance matrix. Through these self-supervised learning tasks based on spatial structure, graph neural networks can effectively infer the spatial information of compounds, which can positively influence the characterization of compounds.

    • Experimental results

    GEM achieved the best performance in the benchmark datasets of 14 compound attributes, which are currently recognized in the academic community for compound attribute prediction datasets. For example, on toxicity-related datasets (tox21, toxcast) and HIV (AIDS) virus datasets, GEM predictions far outperform other baseline models. Overall, Baidu's GEM model is 8.8% higher than the current method in regression tasks such as ESOL and FreeSolv, and 4.7% higher in classification tasks such as BACE, BBBP, and SIDER. In addition, the ablation experiments on the self-supervised learning method also demonstrate the effectiveness of the self-supervised learning method based on the spatial structure.

    Landed ADMET druggability prediction and drug screening scenarios

    Geometric conformation enhanced AI algorithm GEM,It can learn the knowledge of the spatial structure of compounds well, and infer the spatial structure information independently, so as to accurately predict the ADMET properties of candidate compounds - Absorption, Distribution, Metabolism, Excretion and toxicity (Toxicity) to help rapidly screen compounds with higher potential success rates in the early stages of drug development. It is understood that Baidu's research has been applied to the field of drug research and development, and has been commercialized in the early drug screening pipeline of partners.

    In addition, the geometric conformation-enhanced AI algorithm GEM also plays a key role in drug virtual screening and drug combination. Drug virtual screening is an important part of drug development, which aims to find candidate compounds with strong affinity to target targets from a large-scale virtual compound library. Drug combination is used to help find the best synergistic effect of a given drug in a cell line by predicting the synergistic effect of two drugs in different cell lines. The two drugs with synergistic effect can reduce the generation of drug resistance while ensuring the therapeutic effect. And improve the safety of the drug by reducing the dose used.

    About Baidu Propeller PaddleHelix

    Propeller PaddleHelix is ​​based on the Baidu Paddle Paddle Deep Learning Framework and is a biocomputing platform for biomedicine_ Researchers in the span2span field provide comprehensive AI + biocomputing model tools and technical solutions. At present, the Propeller PaddleHelix platform has opened multiple models, covering molecular generation, virtual screening, ADMET prediction, protein/ RNA structure prediction, mRNA sequence design, dual drug combination, etc.

    In addition, the Propeller PaddleHelix team has also carried out related work on PPI protein-protein interactions, omics characterization and precision medicine, and has performed in many international Good results have been achieved in the competition, and the relevant research results will also be opened for everyone to experience and try out one after another. In the future, the propeller PaddleHelix biological computing platform will continue to uphold an open-source and open attitude, continue to work with partners to empower the biological computing industry, and jointly build an AI + biological computing ecosystem and services.

    The spatial structure-based compound characterization learning method GEM has been opened to the public through the Propeller PaddleHelix platform, and everyone is welcome to use it.

    • GitHub address: https://github.com/PaddlePaddle/PaddleHelix
    • Platform address: https://paddlehelix.baidu.com/
    • Cooperation negotiation: baidubio_cooperate@baidu.com _span2s
    .