New AI Platform Elucidates Regulatory Activity in the Genome
Researchers at the University of California, San Diego (UCSD) have developed a deep learning software, named EUGENe (genomic elements with neural nets), to aid in the study of gene regulatory mechanisms. Detailed in Nature Computational Science, EUGENe is designed to streamline the execution of extracting, transforming sequence data, training computational models, and interpreting results in regulatory genomics.
Researchers investigating the complex gene regulatory mechanisms involved in healthy and disordered biological processes now have a new tool in their kit. Researchers at the University of California, San Diego (UCSD), and elsewhere have developed deep learning software that they claim can be adapted to work for various genomics projects. Details of the software, dubbed genomic elements with neural nets or EUGENe, are provided in Nature Computational Science in a paper titled, “Predictive analysis of regulatory sequences with EUGENe.”
According to the paper, EUGENe’ comprises various modules and subpackages for extracting and transforming sequence data, instantiating and training computational models, and evaluating and interpreting how the models behave after training. “The major goal of EUGENe is to streamline the end-to-end execution of these three stages to promote the effective design, implementation, validation, and interpretation of deep-learning solutions in regulatory genomics,” the scientists wrote.
Deep learning is certainly not new to the genomics community. As an example, the technology has successfully been used to detect DNA and RNA protein binding motifs and to make predictions about chromatin states and transcriptional activity. But designing and deploying deep learning-based workflows for genomics studies has always been challenging even for experienced researchers. That’s at least in part because “nuances specific to genomics data create an especially high learning curve for performing analyses in this space. On top of this, the heterogeneity in implementations of most code associated with publications greatly hinders extensibility and reproducibility,” the authors wrote.
Adam Klie, a PhD student at UCSD School of Medicine and the study’s first author, designed the software to mitigate those challenges which he also experienced in his own work. “A lot of existing platforms require many hours of coding and data wrangling to use,” he said. EUGENe is much simpler to operate. “[Y]ou give an algorithm a sequence of DNA and ask it to make predictions about anything you’d expect that DNA could predict, such as whether a particular DNA sequence is functional or whether it regulates a gene in a certain biological context.” Scientists can use the software to explore the various properties of the sequence in question and what happens when things are modified.
The researchers put EUGENe through its paces by attempting to reproduce the results of three regulatory genomics studies that use different types of sequencing data. These datasets came from an assay of plant promoters, RNA binding protein specificity data, and ChIP-sequencing data from the ENCODE project. Analyzing different types of data would typically require mixing and matching multiple technology platforms. However, the scientists were able to successfully adapt EUGENe to each data type and reproduce the findings of each study.
The ability to do this type of reproducible analysis is critical in scientific research but can be challenging for studies that use deep learning, Hannah Carter, PhD, associate professor at UCSD School of Medicine and one of the authors on the paper, noted. “EUGENe is already showing a lot of promise in how adaptable it is to different types of DNA sequencing data and supporting a lot of different deep learning models. We hope it will evolve into a platform that can support collaborative tool development by the research community and accelerate genomics research.”
At the moment, the solution works with DNA and RNA data but “does not have dedicated functions for handling protein sequence or multimodal inputs,” the researchers wrote. They plan to expand it to include new data types such as single-cell sequencing.
They will also make the solution available more broadly to the scientific community. “Deep learning can provide valuable insights into the biological machinery driving this variety, but it can be challenging to implement for researchers without extensive computer science expertise,” Carter said. “We wanted to create a platform that can help genomics researchers streamline their deep learning data analysis to make predictions from raw data.”