EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development

2024/05/0721:55:34 science 1403

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution.

Background introduction

Structural variations (SVs), including deletions, inversions, duplications and translocations, can directly lead to the occurrence of tumors and other diseases through a variety of mechanisms. Recently, studies have shown that SVs can bring distal enhancers near proto-oncogenes and lead to upregulation of oncogene expression through a mechanism known as enhancer hijacking. The discovery of recurrent SVs has greatly improved people's understanding of tumor occurrence and is conducive to effective targeted therapy.

Despite their importance, genome-wide detection of SVs remains a challenging problem. Traditionally, karyotyping has been the main method for clinical detection of various genetic diseases and is inherently a low-throughput and low-resolution method. In addition, the gene chip has been used to identify gains and losses of genetic material, but it has limitations in detecting copy number neutral events such as inversions and balanced translocation . In recent years, short-read whole-genome sequencing (short-read whole-genome sequencing , WGS) has been widely used to identify various genomic variants due to its high resolution, high-throughput, and simplicity. However, due to the mappability problem of short reads, it is difficult to detect SVs in repetitive regions using WGS.

Recently, researchers and other groups discovered that Hi-C, a technology originally used to study three-dimensional genome structure, can also be used for systematic SV detection with genome coverage as low as 1×. So far, three methods have been proposed to predict SVs using Hi-C data, but all have their limitations and are therefore not optimal. HiCtrans and HiNT-TL cannot predict intrachromosomal SVs, while Hi-C breakfinder can only detect large intrachromosomal SVs with a size of 1 Mb.

Main content

EagleC developed by Yue Feng and others from Northwestern University can uniquely capture a set of fusion genes missing from whole-genome sequencing or nanopores. is superior to the existing method in terms of precision and recall. In addition, EagleC can effectively capture SVs on other chromatin interaction platforms, such as HiChIP, chromatin interaction analysis and paired end tag sequencing (ChIA-PET), and capture Hi-C. The researchers applied EagleC to more than 100 cancer cell lines and primary tumors and identified a valuable set of high-quality SVs. Finally, the researchers demonstrated that EagleC can be applied to single-cell Hi-C and used to study SV heterogeneity in primary tumors.

EagleC Framework Overview

Figure 1A describes the overall design of the EagleC framework. Positive training samples were defined as Hi-C contact matrices around a set of high-confidence SVs that were detected in 8 cancer cell lines (A549, Caki2, K562, LNCaP, NCI-H460, PANC-1, SK-N-MC and T47D). In addition, in order to enable the model to distinguish true SV signals from false positive signals induced by normal 3D genome features, the researchers sampled similar numbers of intrachromosomal and interchromosomal subunits from the Hi-C map of the normal cell line GM12878, and Labeled intranegativity and internegativity respectively. Additionally, matrices from cancer Hi-C data, which are located in SV blocks but do not overlap with breakpoints, are included as additional negative data sets.

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

Figure 1. EagleC predicts the full range of high-resolution SVs based on chromatin interaction data. Image from Sci. Adv.

The researchers used downsampled versions of the training samples to train a series of EagleC models optimized for different sequencing depths. In order to study the performance of EagleC, the researchers predicted SVs in other cancer Hi-C datasets that were not used in the training process (the default resolution of all SVs in the article is 5kb). EagleC successfully predicted different types of SVs, including short-range SVs with breakpoint distances less than 1 Mb or even 100 kb (Figure 1, B-D), large intrachromosomal SVs (Figure 1E), and reciprocal interchromosomal translocations (Figure 1F) and nonreciprocal interchromosomal translocations (Fig. 1G).

EagleC outperforms existing methods in detecting SVs on Hi-C graphs

The researchers first visually inspected the prediction results and found that almost all blocks with unusually high interaction frequencies were predicted as SVs, indicating that the framework has high sensitivity (Figure 2A). In many cases, although EagleC and Hi-C breakfinders predicted the same SV block, the exact coordinates of the predicted breakpoints differed, with EagleC predicting breakpoints more likely to be verified by WGS (Figure 2A, regions "A", "C" , "D" and "E"). Additionally, EagleC predicts breakpoints more accurately at 5kb resolution than Hi-C breakfinder, which typically predicts 100kb resolution.

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

Figure 2. EagleC’s superior performance in precision and recall. Image from Sci. Adv.

Then, the researchers conducted a more in-depth comparison of the two currently only methods that can identify SVs within chromosomes---EagleC and Hi-C breakfinder. It is worth noting that the SVs (including interchromosomal translocations and intrachromosomal SVs) detected by EagleC in BT-474, HCC1954 and MCF7 were 2.4 times (244/100) and 2.6 times (410/100) than Hi-C breakfinder, respectively. 157) and 4.8-fold (244/51) (Fig. 2B). At the same time, EagleC achieved significantly higher accuracy than Hi-C breakfinder in these cell lines.

In BT-474, 24.2% of EagleC predicted SVs match 59.0% of Hi-C breakfinder predictions. Of the 185 SVs unique to EagleC, 83.2% could be verified by WGS or nanopore, compared with 2.4% of SVs unique to Hi-C breakfinder (Figure 2C).

Next, the researchers expanded the analysis to an additional 26 cancer cell lines or patient samples with Hi-C and WGS data. It was again observed that EagleC achieved significantly higher recall and precision rates across all 26 cancer samples compared to Hi-C breakfinder (Figure 2D-F). Due to the limitations of the algorithm itself, Hi-C breakfinder can only detect large intrachromosomal SVs larger than 1 Mb. However, as shown in Figure 2G , 39.5% of the intrachromosomal SVs predicted by EagleC were short-range SVs, with a minimum size of 35 kb. Surprisingly, although SVs of this range were thought to be difficult to distinguish from other Hi-C contact patterns, their prediction accuracy was even higher than that of long-range SVs and translocations (Figure 2H).

EagleC detects novel fusion genes in cancer

As shown in Figure 3A, EagleC detected breakpoints within the ATXN7 and BCAS3 genes in MCF7, and the arriba software also predicted the fusion of these two genes (Figure 3A, right). We present two additional such examples in Figure 3, demonstrating that EagleC can uniquely predict WGS and Nanopore-deleted fusion genes due to its high-resolution nature. In addition, genes involved in these fusion events were significantly overexpressed in cancer cells compared with non-malignant cell lines that did not undergo fusion (Fig. 3D).

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

Figure 3. The only fusion gene detected by EagleC is overexpressed in cancer cells. Picture from Sci. Adv.

EagleC can accurately predict SVs

using other 3C-based technologies. Researchers directly applied the EagleC model trained with Hi-C data to CTCF ChIA-PET and Pol2 ChIA-PET. Overall, EagleC predicted a similar number of SVs in Hi-C, CTCF ChIA-PET and Pol2 ChIA-PET, and there was considerable overlap between the three data sets (Figure 4A-B). For example, EagleC predicted 226 SVs in CTCF ChIA-PET, 66.4% of which were also predicted in Hi-C. Likewise, 62.8% (123 of 196) of the SVs predicted in Pol2 ChIA-PET matched 50.4% (123 of 244) of those predicted by Hi-C. In terms of accuracy, EagleC achieved comparable accuracy in two ChIA-PET datasets (CTCF, 65.5%; and Pol2, 68.2%) compared to Hi-C (73.8%) (Figure 4C). In addition, we observed that the recall and accuracy of SVs predicted by EagleC were significantly higher than those of Hi-C breakfinder in all 10 HiChIP/ChIA-PET datasets that matched WGS data (Figure 4D-F).

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

Figure 4. EagleC can accurately predict SVs on HiChIP and ChIA-PET contact maps.Image from Sci. Adv.

html Detection of SVs in 3105 tumor specimens

If multiple datasets are available in the same sample, combining their results can form a more comprehensive set of SV annotations. We predicted 5620 SVs across all samples, with the number ranging from 2 to 410 in each sample (Figure 5A). Across all sample data, 30.9% of predicted SVs were short-range SVs (1 Mb), 35.7% were long-range SVs, and 33.4% were interchromosomal translocations.

At the superenzyme scale, mammalian genomes are organized into TADs. TAD boundaries enriched in CTCF binding sites provide an insulated environment for proper gene regulation. Compared with the expected distribution of randomly scrambled SVs, we found that the breakpoints of SVs were significantly closer to the TAD boundary, which is consistent with previous studies on DNA topoisomerase II beta (TOP2B)-mediated DNA double-strand breaks. Enriched at anchor points of chromatin loops (Fig. 5C). Overall, approximately 10% of SVs occurred between TAD boundaries, 37.5% occurred between TAD boundaries and intra-TAD regions, and 52.5% occurred between intra-TAD regions (Figure 5D). In addition, we found that transcriptional start sites (TSSs) of cancer-related genes were specifically enriched at breakpoint-related TAD boundaries (Figure 5E). This suggests that genome rearrangement disrupts TAD boundaries and may be an important mechanism for oncogene dysregulation and tumorigenesis.

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

Figure 5. Pan-cancer analysis of SVs in 105 cancer cell lines or patient samples. Image from Sci. Adv.

To further explore the value of our SV annotations, we identified genes that were repeatedly affected by short-range SV in different samples. It was found that most of the deleted genes were tumor suppressor genes (Figure 5F), such as CDKN2A/2B, WWOX, CHFR and MSH2 genes. On the other hand, many genes within the repeated region are the oncogene (Fig. 5G), such as MYC. The CD44 gene, a common biomarker for cancer stem cells , encodes a cell surface glycoprotein involved in tumor initiation and progression.

EagleC predicts known interchromosomal translocations in single cells

To make EagleC applicable to scHi-C with limited contact information per cell, we downsampled the contact maps of the same eight cancer cell lines and GM12878 cells. to a comparable sequencing depth and retrained the model at 500 kb resolution. The researchers then tested EagleC on published scHi-C datasets in HAP1 and K562, both chronic myeloid leukemia cell lines. Chromosomes 9 and 22 are reciprocally translocated in HAP1 cells, while chromosomes 9 and 22 are non-reciprocally translocated in K562 cells. The HAP1 dataset contains 256 single cells with a median of 18,793 contacts per cell in and , whereas the K562 dataset contains 337 cells with a median of only 3,974 contacts per cell (Figure 6A). Remarkably, even using these extremely sparse contact matrices, EagleC was able to predict the known chr9-chr22 translocation in single cells (Figure 6B-C).

EagleC is a framework that combines deep learning and ensemble learning strategies to predict the full range of structural variations at high resolution. Structural variations, including deletions, inversions, duplications, and translocations, can directly lead to the development - DayDayNews

Figure 6. EagleC can accurately predict SVs on HiChIP and ChIA-PET contact maps. Image from Sci. Adv.

To systematically study the lower limit of the number of contacts for accurately predicting SVs in a single cell, the researchers sequenced all 256 HAP1 cells according to sequencing depth and generated a Series of contact matrices (contact pairs from 1,486,350,000 to 4,050,000) (Figure 6D). As expected, the number of predicted SVs decreased with increasing cell number (Figure 6E).

Conclusion Summary

This paper uses the advantages of CNNs in image recognition and ensemble learning to avoid over-fitting problems. The EagleC developed in this paper can not only predict unique short-range SVs, but also greatly improves the overall prediction ability compared with existing methods. This article demonstrates the feasibility of using Hi-C to detect fusion genes. Although the current framework cannot achieve base pair resolution, Hi-C is unique in detecting fusion points within introns compared with RNA-seq. Ability. Furthermore, EagleC can be used as a general model to predict SVs using other 3C-based contact maps, including ChIA-PET, HiChIP/PLAC-Seq, capture Hi-C, and even scHi-C.

References

Wang X, Luan Y, Yue F. EagleC: A deep-learning framework for detecting a full range of structural variations from bulk and single-cell contact maps[J]. Science Advances, 2022, 8(24): eabn9215 .

Zhongda Weixin headline account, Zhongda Weixin public account, and Weixin computing subscription account are all operated by Zhongda Weixin Technology Co., Ltd. You are welcome to follow and repost. Unauthorized reprinting of

is prohibited.

science Category Latest News

With the development of the economy, average life expectancy continues to increase, and the problem of population aging and aging is becoming more and more prominent. Therefore, both scientists and the general public hope to find a good medicine to slow down aging and even extend - DayDayNews

With the development of the economy, average life expectancy continues to increase, and the problem of population aging and aging is becoming more and more prominent. Therefore, both scientists and the general public hope to find a good medicine to slow down aging and even extend

"Nature" sub-journal: Discover a new "switch" for life extension! Scientists discovered that the transcription factor FOXM1 has anti-aging and life-extending effects, and can extend the life span of mice by 28%丨Scientific Discovery