Maps of Biology – and Benchmarks – For All

November 7, 2024

Written By:

No items found.

Generation of large-scale perturbation datasets is ramping up across the biopharma industry – and with it a need for AI tools to make sense of that data as well as benchmarks that allow those datasets and tools to be easily evaluated. In a new paper in PLOS Computational Biology, a team of scientists at Recursion and Genentech offer the first comprehensive guide for the broader research community on how to create Maps of Biology using their own datasets, along with key benchmarks for measuring their performance.

“The idea is to provide the community with a framework they can use to replicate what we are doing,” says Safiye Celik, Associate Director of Data Science at Recursion, and one of the paper’s lead authors.

Creating a Map of Biology begins with a reasonably large dataset – ideally in the range of several thousands of genes if you are creating a genetic map, Safiye C. says. These datasets can come from different perturbation types such as CRISPR interference or CRISPR knockout, and may have different types of readouts such as transcriptomic data or images. “As long as your dataset has a sufficient scale of genetic and/or chemical perturbations, you can transform readouts into biological relationships and apply all of these benchmarks to measure the quality of the resulting map,” she says.

The researchers have also made available a public codebase. This repository contains the code for map building and benchmarking, along with the benchmark annotations. The aim is to facilitate more comparable analysis and optimization of maps within the broader research community.

EFAAR: The Map-Building Pipeline

To turn perturbation datasets into maps, we developed the “EFAAR” pipeline, which is explained in the paper.

“EFAAR” refers to:

Embedding assay data from each perturbation unit to generate a tractably-sized numeric representation;
Filtering perturbation units that do not pass quality criteria;
Aligning different batches of perturbation units;
Aggregating replicate units representing each targeted perturbation (e.g., a gene);
Relating different perturbations to each other with one or more numeric values.

Researchers at Recursion and Genentech recently utilized this framework when they collaborated on building the world’s first neuroscience phenomap, or “Neuromap”. To generate the data, researchers built specific cell manufacturing technologies that could derive neurons from human-induced pluripotent stem cells (hiPSCs) at scale – ultimately producing over 1 trillion hiPSC-derived neuronal cells. The map generated by applying the EFAAR pipeline on this data is intended to facilitate the discovery of novel therapeutic candidates for neurodegenerative diseases.

Graphical abstract of the introduced framework.

To demonstrate how other researchers can use these same tools to make their own maps, the authors applied EFAAR techniques to four datasets, constructing 18 different maps. These include Recursion’s public RxRx3 dataset – deep neural network embeddings of phenomic images representing about 17,000 genes in primary HUVEC cells; GWPS (Genome-Wide Perturb-Seq) that perturbed 10,000 expressed genes in K562 cells; cpg0016 from the JUMP (Joint Undertaking of Morphological Profiling) consortium comprising 8,000 druggable gene knockouts in the U2OS cell line; and cpg0021 containing data from 20,000 perturbed genes in HeLa cells. These datasets represent diverse perturbation types and readouts: single-cell transcriptomic data treated with CRISPR interference (GWPS), arrayed morphological screening with CRISPR-Cas9 knockout perturbations (RxRx3 and cpg0016), and pooled optical screening with CRISPR-Cas9 knockouts (cpg0021).

Followed by map building, the authors applied proposed benchmarks assessing each map’s perturbation signal rates and ability to identify known biology based on five different annotation datasets. The underlying hypothesis for the latter is, as they write: “If a map can identify known relationships to a high degree, it is an indication that it demonstrates a strong representation of existing biology, and is therefore more likely to accurately represent and uncover novel biological relationships.”

Observations

The researchers found that maps from two datasets – RxRx3 and GWPS – consistently performed best across perturbation signal and biological relationship benchmarks, and say they underscore the differences in assay design and their impact on a map’s performance. For example, RxRx3 uses a unique negative control strategy, and also uses single-guide treatments in wells which may provide more stable representations of gene perturbations than those from other imaging datasets. Sample size also matters – with increasing sample size leading to better biological insights and reduced error.

To broadly evaluate the utility of the constructed maps in biological discovery tasks, the researchers looked at which protein complexes the maps were able to identify. This analysis was based on CORUM, a database of manually annotated protein complexes from mammalian organisms. Twelve complexes were consistently identified by all of the three datasets with fully unblinded metadata, and all of these 12 complexes involved fundamental cellular processes crucial for the proper functioning and regulation of a cell.

A split cosine similarity heatmap of the Integrator complex subunits from the RxRx3 and GWPS maps.

The researchers also found 50 complexes that were uniquely identified by the CRISPRi-based GWPS map with a transcriptional readout. They demonstrated how well the modular structure of one of these protein complexes – the Integrator – was captured by the maps constructed using other perturbation datasets. The substructures within the Integrator were accurately identified by RxRx3 and by GWPS, but not by cpg0021. (The cpg0016 map could not be assessed here since only one of the genes in the Integrator complex was screened in that dataset).

Finally, in a deeper analysis of the RxRx3 and by GWPS maps, the researchers uncovered evidence for the roles of two poorly annotated genes, C18orf21 and C1orf131, corroborating recent studies around the suggested functions of these genes. This is a prime example of how the constructed maps can be employed to elucidate the roles of lesser-known genes in biological research.

The researchers note that the framework they present can be applied to any large-scale biological map building and benchmarking effort, and can ultimately include a much broader range of perturbation types and assay variables.

“The next steps involve others utilizing the codebase and framework, applying it to their unique datasets, and sharing their findings,” says Celik. “Using the same codebase will make these results comparable and reproducible – and not only the research will advance, but the code will continuously improve and evolve as well.”

‍

Read the full paper at PLOS Computational Biology: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012463

‍

Researchers include: Safiye C., Jan-Christian Hütter, Sandra Melo Carlos, Nathan Lazar, PhD, Rahul Mohan, Conor Tillinghast, Tommaso Biancalani, Marta Fay, Berton Earnshaw & Imran Haque