Edit | The clinical efficacy and safety of the drug Luluo
depends on its molecular characteristics and targets in the human body. However, it is challenging to evaluate the proteome range of all compounds in human and even animal models.
Recently, researchers from Hunan University developed an unsupervised pre-trained deep learning framework called ImageMol, which is chemically conscious and is used to learn molecular structures from large-scale molecular images. Provides a powerful pre-trained deep learning framework for computing drug discovery.
Compared with state-of-the-art methods, ImageMol has two important improvements: (1) It uses molecular images as characteristic representation of compounds, with high accuracy and low computational costs; (2) It uses an unsupervised pre-trained learning framework to capture structural information of molecular images from 10 million drug-like compounds with different biological activities in the human proteome.
The study was titled "Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework" and was published on "Nature Machine Intelligence" on November 21, 2022.
Paper link: https://www.nature.com/articles/s42256-022-00557-6
Despite the progress in Biomedical research and technology, drug discovery and development are still a challenging multidimensional task, requiring the optimization of important properties of candidate compounds, including pharmacokinetics, efficacy and safety. Traditional experimental methods are not feasible to evaluate molecular targets of all candidate compounds in human and even animal models on a proteome-wide basis. Computational methods and techniques are considered a promising solution that can greatly reduce costs and time throughout the drug discovery and development process.
Artificial intelligence technology is used in drug design and target recognition. One of the basic challenges is how to learn molecular characterization from chemical structure . Traditional molecular representation methods rely on a lot of domain knowledge to extract molecular features.
With the rise of unsupervised learning in natural language processing, recent methods combine unsupervised learning with one-dimensional sequence string , such as the simplified molecular input line input system (SMILES) and international chemical identifiers (InChI), or two-dimensional graphs. However, their accuracy in extracting information vectors used to describe molecular identity and features of molecular biology is limited. Computer Vision Recent advances in unsupervised learning show that unsupervised image-based pretrained models can be applied to computational drug discovery.
Here, the Hunan University research team proposed an unsupervised pre-training deep learning framework called ImageMol, which pre-trained 10 million unlabeled drug-like biologically active molecules to predict the molecular targets of candidate compounds. The ImageMol framework is designed to pre-train chemical characterization from unlabeled molecular images based on local and global structural characteristics of molecules from pixels.
Figure 1: ImageMol framework. (Source: Paper)
ImageMol framework
Researchers have developed a pre-trained deep learning framework ImageMol for accurate prediction of molecular targets. ImageMol pre-trained 9,999,918 images of bioactive molecules of drug-like substances from the PubChem database. The researchers assembled five excuse tasks to extract structural information related to biology : (1) the molecular encoder is designed to extract potential features from about 10 million molecular images; (2) five pre-training strategies are used to optimize the potential representation of the molecular encoder by considering chemical knowledge and structural information in the molecular image; and (3) the pre-trained molecular encoder fine-tunes for downstream tasks to further improve model performance.
ImageMol Benchmark Evaluation
Researchers demonstrated the high performance of ImageMol in evaluating the molecular properties of 51 benchmark datasets (i.e., drug metabolism, brain penetration, and toxicity) and molecular target profiles (i.e., β-secretase and kinase ).
First evaluated the performance of ImageMol using eight types of drug discovery benchmark datasets, and then, three popular split strategies (scaffold split, balanced scaffold split, and random scaffold split) were used to evaluate the performance of ImageMol on all benchmark datasets.
Figure 2: Performance evaluation of ImageMol using the benchmark dataset. (Source: Paper)
In the classification task, ImageMol achieves high AUC values using the area under the receiver operating characteristics (ROC) curve (AUC) (Figure 2a). In addition, the probability distribution similarity of ImageMol on the BBBP and BACE datasets is greater than 95%, and shows that ImageMol has high consistency and stability during training.
Figure 2c shows that ImageMol also achieved higher AUC values (range from 0.799 to 0.893) in predicting inhibitors and non-inhibitors of five major drug metabolic enzymes compared to the three most advanced molecular image-based representation models.
further compares the performance of ImageMol with three state-of-the-art molecular representation models: (1) fingerprint-based model, (2) sequence-based model, and (3) graph-based model. As shown in Figure 2d, e, ImageMol has better performance compared to fingerprint, sequence, and graph-based models using random scaffold split.
In the compound-protein binding prediction task, ImageMol achieved better performance on ten GPCRs (regression tasks) and ten kinases (classification tasks) compared with existing methods.
A further used the McNemar test to evaluate the statistical significance of performance differences between state-of-the-art models and ImageMol. ImageMol shows statistically higher performance compared to existing methods on multiple datasets.
In summary, ImageMol achieves improved performance in various drug discovery tasks, outperforming state-of-the-art methods.
ImageMol shows high accuracy in identifying anti-SARS-CoV-2 molecules in 13 high-throughput experimental data sets at the National Center for Translational Science Promotion. By ImageMol, candidate clinical 3C-like protease inhibitors for potential treatment of COVID-19 were identified. Biological interpretation of
ImageMol
Next, molecular representations from different models were visualized using t-SNE to test the biological interpretation of ImageMol. The researchers used clusters identified by the Multiparticle Chemical Cluster Classification (MG3C) task (method) to split molecular structures. The study found that ImageMol can distinguish the molecular structure well, which is better than MACCS fingerprint and non-pretrained models. ImageMol prior knowledge that can capture chemical information from molecular image representations, including =O bond, -OH bond, -NH3 bond and benzene ring . Further use the Davies–Bouldin (DB) index to quantitatively evaluate clustering results: smaller DB indexes represent better performance. The study found that ImageMol (DB index 1.92) is better than MACCS fingerprint (DB index 2.93); in addition, the pre-trained model can greatly improve molecular characterization of (the DB index without pre-training of ImageMol is 19.40).
Figure 3: Biological explanation of ImageMol. (Source: Paper)
Gradient Weighted Class Activation Mapping (Grad-CAM) is a commonly used CNN visualization method. Description of 12 sample molecules for Grad-CAM visualization of ImageMol. ImageMol simultaneously accurately captures attention to global and local structural information. ImageMol is predicted based on molecular structure, rather than using meaningless blank areas.
Then, the hit ratios of coarse and fine-grained size are further calculated. Coarse-grained hit rate description ImageMol can use the molecular structure of all images for inference at a ratio of 100%, while the QSAR-CNN model is 90.7%. fine-grained hit rate shows that ImageMol can use almost all structural information in molecular images for reasoning, with a proportion of more than 99%, reflecting its ability to capture global information of molecules.
In short, ImageMol captures biologically relevant chemical information of molecular images, which is better than the existing state-of-the-art deep learning methods.
Improvements in potential directions
Several potential directions may be further improved ImageMol: (1) The integration of larger scale biomedical data and larger capacity molecular image models will inevitably be the focus of future work; (2) Multi-view learning of combined images and other representations (such as SMILES and graphics) is an important research direction; (3) Incorporating more chemistry knowledge (such as atom properties, chemical properties, and 3D structural information) into each image or pixel region is also a promising future direction.
In short, ImageMol is a strategy based on active self-supervised image processing that provides a powerful toolbox for computational drug discovery in various human diseases.