Recursion is releasing the first in a potential series of foundation models for external use, hosted on the NVIDIA BioNeMo platform. This series of models is called Phenom, which is a play on words hinting at both ‘phenomenal’ and ‘phenomics’ (defined below). This release follows the multi-year collaboration with NVIDIA we announced last July.
Images are a powerful tool for unraveling the complex, interconnected network of relationships across biological systems. Cell images provide an extraordinary amount of information about how that cell is functioning. Moreover, with the help of high-throughput screening, images can be standardized and scaled to capture the cell’s responses to millions of different chemical and genetic perturbations, such as a gene knockout or a potential drug.
At Recursion, we call this phenomics - the systematic study of a cell’s phenotype in response to many different chemical or genetic perturbations. It’s one of several layers of data that form the foundation of our maps of biology and chemistry and allow us to discover novel relationships that lead to new drug discovery programs.
And now, after investing nearly a billion dollars to build the Recursion OS, we are pleased to release one important component of our work, a phenomics foundation model we call Phenom-Beta. It flexibly processes cellular microscopy images into general-purpose embeddings at any scale, from small projects to billions of images. In other words, Phenom-Beta can turn a series of image inputs into meaningful representations that are foundational to analyzing and understanding the underlying biology. We are putting some of the power of Recursion’s approach into a form accessible to the scientific community, subject to commercial limitations (please see the license details).
To pick up on the subtle changes in cellular morphology that are often undetectable to the human eye, we use computer vision models like Phenom-Beta that can extract biologically meaningful features to create a digital representation. That allows us to systematically relate genetic and chemical perturbations to one another in a high-dimensional space, helping determine critical mechanistic pathways and identify potential targets and drugs.
Phenom-Beta was trained using the RxRx3 dataset, a publicly available dataset we released last year containing approximately 2.2 million images of HUVEC cells across ~17,000 genetic knockouts and 1,674 known chemical entities. Despite being trained on a specific imaging assay, based on Cell Painting, the model can be applied and utilized widely across different assays, such as brightfield images and the JUMP-CP dataset which uses the original Cell Painting assay.
As we were training this model, we shared progress on how effectively the model could reconstruct partially masked images – the training task for the model’s performance. We were excited to discover that as we increased the size of the training data and the number of parameters, the model’s performance increased, demonstrating that The Scaling Hypothesis holds true in this domain of biology. This highlights the importance of having a data generation strategy in order to create high-quality datasets for machine learning training, along with ample compute to handle larger models. The largest model performed up to 28% better at recapitulating known biological relationships, which we shared in a NeurIPS workshop paper in November.
Researchers will be able to access Phenom-Beta models at supercomputing scale through an easy-to-use Cloud API, available through the NVIDIA BioNeMo platform. Our most advanced foundation model, known as Phenom-1, is currently in production for our internal teams and close partners.
To get access to Phenom-Beta, please initiate the process by applying for BioNeMo Beta and then complete the process by signing the Recursion terms & conditions for non-commercial use. If you are interested in using Phenom-Beta for commercial use, please contact us.