Fast and Scalable Analysis of 5 and 6-Base Genomes

Poster

Last Updated: July 25, 2024

(+ more)

Published: July 22, 2024

Credit: iStock

DNA methylation affects how and when genes are expressed. Hence, abnormal methylation can be linked to the development and progression of various diseases.

However, analysis of methylation data can be challenging. Many traditional tools have complex workflows and lack scalability, meaning that standard analyses require scale computing infrastructure.

This poster highlights a bioinformatics workflow for the analysis of genomic and methylome data that provides access to unprecedented insights into current and future biological states.

Download this poster to discover a software package that:

Offers an extensive range of visualizations and analyses of methylation profiles.
Provides a user-friendly experience with a simple and intuitive interface.
Can analyze multiomic data.

Unlocking scalable and efficient multiomic analysis of 5- and 6- base genomes Analysing methylation data is challenging, many existing analysis tools are difficult to work with and do not scale well as the number of samples increases. This lack of scalability means that standard analyses, such as identifying differentially methylated regions (DMRs), or summarising methylation fractions over genomic regions, require substantial time and memory – typically necessitating large scale compute infrastructure (e.g. compute clusters, cloud). duet multiomics solution evoC is a new sequencing technology, that simultaneously derives all four genetic bases without ambiguity in C or T calls, alongside distinguishing 5-methylcytosine and 5 hydroxymethylcytosine (6-base data) in a single read from a single DNA molecule [1] (Figure 1). The technology consists of pre-sequencing library prep and post-sequencing analysis pipeline, providing single-base resolution of genetics and epigenetics at high accuracy. This expansion of biological signal that can be generated in a single sequencing experiment further increases the scale and complexity of downstream analysis, necessitating the development of more efficient analysis software. To address this challenge, we present modality, a fast and scalable array-based python package for the analysis of 5 and 6-base genomes (genetics, 5-mC and 5-hmC). 1. Introduction Nicholas Harding, Michael Wilson, Jean Teyssandier, David Currie, Casper Lumby, William Stark, Mark S. Hill, Páidí Creed DNA sample Pre-sequencing lab protocol NGS sequencing Bioinformatic pipeline Insight + + + = 1 - Strand synthesis - creates a single molecule with a direct copy of the original information tethered together with a hairpin. The copy strand is without cytosine modifications initially, but importantly, utilises a high fidelity methyltransferase to specifically copy 5mC from the original to the copy strand. 2 - Paired-end read sequencing - generates sequence information after protection of cytosine modifications followed by deamination of all remaining cytosines to uracils, read as thymine in SBS. 3 - Read resolution - aligns original and copy strands to correctly call all 4 canonical bases in addition to 5mC and 5hmC. 4 - Aligned (4 base) reads with 5mC & 5hmC are tagged (6-Base information) Figure 1 | duet multiomics solution evoC is a 6-base calling technology that reads all four canonical bases plus 5mC and 5hmC. 2. duet multiomics solution evoC 3. modality core concepts Parallel processing and memory efficiency modality leverages the parallel processing capabilities of dask for efficient computation. Multi-dimensional Data Interaction modality enables seemless interaction with datasets across multiple dimensions using xarray. zarr storage backend modality interacts with zarr, leveraging chunked and compressed sets of n- dimensional arrays, allowing analyses to efficiently scale to many samples (>100) even with very limited RAM. Figure 3 | Performance benchmarking of modality One common operation that provides useful insight into the relationship between samples, in terms of their methylation profiles, is to compute a pairwise Pearson correlation matrix (see Figure 4). This is a computationally expensive operation so we used this to benchmark against an existing tool - methylkit. We used a machine with 16Gb of RAM to match what we might expect to be available on a typical laptop and computed the matrices for varying numbers of samples (>=1<=110) using both modality and methylkit. We were not able to generate the matrices using methylkit for datasets >10 samples, this would cause an out-of- memory error and crash. In contrast, we were able to generate these matrices for 110 samples in modality, using <1Gb RAM and running in ~6 minutes. Figure 2 | a view of the modality ContigDataset The core data structure used by modality is the ContigDataset. This contains arrays that represent methylation counts as well as accompanying arrays which encode the coordinates. This object also provides a set of easyto-use and efficient methods for working with them. Each of the arrays are chunked Dask arrays allowing extremely efficient computation. 4. modality in action etc... modality is built around three core python packages: zarr, xarray, and dask. These are powerful modern data science packages which collectively allow modality to deal with larger-than-memory data arrays in an expressive and paralelised fashion. This foundation means that analyses that would previously require long run times and extensive compute infrastructure can now run quickly on one's laptop - speeding up iterative data analysis. 6. References odality As well as focussing on performance, modality was also built with the idea of providing a very user-friendly experience with a simple and intuitive API. For example, calculating pairwise Pearson correlations between samples, for a given variable, and plotting the matrix can be done with the following lines of code: dataset.plot_pearson_matrix( numerator="num_modc", denominator="num_total_c", min_coverage=10, ) See Figure 4 for result. Figure 4 | Pearson matrix plot for Genome in a bottle data The above plot is the output of the code block (left). It was produced using a duet +modC dataset of genome in a bottle (GIAB) samples and highlights the similarity between samples in terms of the fraction of modC calls relative to total C calls. This dataset is distributed with the modality package. Figure 5 | Extensive array of visualisations and analyses for 5- and 6-base genomes modality offers an extensive range of visualisations and analyses to help understand genome-wide methylation profiles. A, 6-base ternary plot showing density of data over bins of C:mC:hmC methylation fractions in mouse Ese14 cells. B, Methylation profile plots over genomic regions. C, Tile plot from differentially methylation regions (DMR) analysis, showing a genomic region with consistently different methylation fractions between two sets of samples. A B C 5. Conclusion[1] Simultaneous sequencing of genetic and epigenetic bases in DNA, Füllgrabe and Gosal et al., Nature Biotechnology (2023) (duet multiomics solution technology paper) To address the difficulties of analysing methylation data, we present modality, an efficient and scalable analysis package for 5- and 6-base genomes. The package is built on a core set of performant data science libraries and roots the user into a powerful ecosystem for data analysis in Python. modality has an intuitive API providing powerful analysis and visualisation methods. Moving forward, the underlying data structure used by modality is very extensible, allowing other data modalities to be incorporated and analysed alongside the methylation data modalities shown here. Efficient and rapid tooling for analysis Rich set of data visualisations

Brought to you by

Download the Poster for FREE Now!

Information you provide will be shared with the sponsors for this content. Technology Networks or its sponsors may contact you to offer you content or products based on your interest in this topic. You may opt-out at any time.