A More Intelligent Filing System Improves Genome Data Management

Anyone who has attempted to organize years of digital photographs understands the overwhelming sensation: storage capacity reached, with no straightforward method to navigate the chaos. Geneticists encounter a variant of this issue on a scale most cannot comprehend. The expense of sequencing a genome has plummeted so rapidly that researchers are now inundated with data they can scarcely store, let alone analyze. The instruments engineered to compare genetic sequences were intended for dozens, perhaps hundreds, of genomes. They falter when tasked with managing millions.

The domain of pangenomics relies on maintaining all that data intact. Analyzing complete collections of genomes from a single species captures every aspect of evolutionary change, but the storage requirements are relentless. Sacrifice detail to conserve space, and you risk overlooking the mutation that clarifies why one viral variant propagates more swiftly than another. Preserve everything, and your storage systems capitulate.

Engineers at the University of California, San Diego, believe they have identified a solution. In research published this month in Nature Genetics, a group led by Yatish Turakhia introduces a data structure known as a Pangenome Mutation-Annotated Network, or PanMAN, which compresses genomic data by utilizing shared ancestry. Rather than keeping each genome as an individual file, the system logs a singular ancestral sequence and documents mutations once, at the precise point on the family tree where they first emerged.

**Storing modifications, not replicas**

The innovation is in viewing genomes less as autonomous documents and more like draft versions of the same manuscript. Closely related sequences generally share the majority of their lineage, so storing them separately results in the duplication of vast quantities of identical data. PanMAN circumvents this by recording what changed and when, maintaining the evolutionary story rather than condensing it into static files.

To showcase the method, the researchers constructed the largest pangenome ever created for SARS-CoV-2, aggregating over eight million publicly accessible viral sequences. A conventional alignment of that dataset would require an astonishing volume of storage. The PanMAN version compacted into 366 megabytes, achieving a reduction of over 3,000 times.

“Our compression approach using PanMANs enables doing more with less, significantly enhancing the scale and scope of contemporary pangenomic analysis,” Turakhia stated.

The format is also adept at handling complex biology. Genes do not consistently transfer smoothly from parent to offspring; bacteria exchange DNA laterally, and viruses recombine. PanMAN employs network edges to encapsulate these occurrences, representing intricate mutations that simpler tools might overlook or dismiss. In evaluations across six microbial species, it surpassed existing formats by factors occasionally surpassing 1,300. Some older software merely crashed when presented with the SARS-CoV-2 dataset.

**The human genome is next**

Viruses and bacteria served as the testing ground. The team is already targeting a significantly larger challenge: human genetic data, which eclipses anything a coronavirus can present. By incorporating metadata such as collection dates and geographic locations directly in the network, researchers could track how a mutation disseminates throughout a population in near real-time.

For the moment, the file resides: 366 megabytes, eight million viral genomes, awaiting on a server for anyone in need.

[Source: Nature Genetics](https://doi.org/10.1038/s41588-025-02478-7)

**There’s no paywall here**

If our coverage has educated or motivated you, please think about contributing a donation. Every gift, regardless of size, empowers us to keep providing accurate, engaging, and reliable science and medical news. Independent journalism necessitates time, effort, and resources—your support guarantees we can continue uncovering the stories that are most significant to you.

Join us in making knowledge accessible and influential. Thank you for standing with us!

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.