pith. sign in

q-bio.GN

Genomics

DNA sequencing and assembly; gene and motif finding; RNA editing and alternative splicing; genomic structure and processes (replication, transcription, methylation, etc); mutational processes.

0
q-bio.GN 2026-07-03

Pipeline annotates mechanisms for 19,293 human genes from papers

by Matteo Di Bernardo, Iain M. Cheeseman

Affinage: genome-scale mechanistic gene annotation from the published literature

Affinage pulls direct experimental evidence to fill gaps where UniProt entries are empty or minimal.

Figure from the paper full image
abstract click to expand
Understanding the mechanistic function of a gene is a critical starting point for biology. However, for much of the human proteome that knowledge is scattered across thousands of primary papers or remains poorly established, while the curated databases biologists rely on can lag years behind recent literature. Large language models can now read and synthesize that literature on demand, but doing so faithfully for many genes is an expensive, non-reproducible retrieval session that does not scale across users. Here, we present Affinage, an LLM pipeline that performs this retrieval and mechanistic reasoning once per gene--from the primary literature alone--and stores the result as a reusable, structured annotation. A biologist-designed reading pass extracts only direct experimental evidence, and a synthesis pass reasons over those findings alone. Applied across the genome, Affinage annotates 19,293 human protein-coding genes. This analysis provides mechanism for thousands of genes whose UniProt function is empty or a stub, beating the curated reference on 99.1% of head-to-head genes as scored by a cross-family LLM judge. Affinage also delineates the 10% of the proteome that remains mechanistically uncharacterized and will serve as a continuously-updated, literature-grounded census of gene function. All records are released openly at https://affinage.wi.mit.edu . More broadly, Affinage serves as an example of how domain experts can encode their expertise into scalable LLM pipelines to improve the publicly available data that guides biological hypotheses and experimentation.
0
0
cs.LG 2026-07-02

New model raises CNS tumor classification accuracy from 82% to 86%

by Paulo R. Ferreira Jr., Lucas Coutinho Freitas +5 more

A Novel Machine Learning Approach for Central Nervous System Tumor Classification from DNA Methylation

Sparse projection and logistic regression improve on prior reference on an independent set of 1,104 clinical samples at both class and famil

Figure from the paper full image
abstract click to expand
NA methylation profiling has become a powerful approach for central nervous system (CNS) tumor classification, yet important challenges remain regarding cross-cohort transferability, methodological correctness, and robust multiclass evaluation. In this work, we propose a novel and methodologically rigorous machine-learning approach for methylation-based CNS tumor classification that combines Sparse Random Projection for dimensionality reduction with multinomial logistic regression for classification. We evaluate the proposed approach in the same general experimental setting established by a widely used reference classifier. On the 2,801-sample reference cohort, our method achieves a mean accuracy of 96\% under stratified 3-fold cross-validation. On the independent 1,104-sample clinical evaluation cohort, it reaches 86\% accuracy at the 91-class level and 93\% when predictions are evaluated at the methylation class family level. These results improve upon the corresponding state-of-the-art reference figures of 82\% class-level concordance and 88\% family-level concordance, yielding absolute gains of approximately 4 and 5 percentage points, respectively. This improvement is clinically relevant: in a diagnostic setting, a 5-point increase in correct tumor classification can directly affect cancer subtype assignment and, in turn, influence treatment selection and downstream clinical decision-making. Our results show that the proposed model, grounded in stronger methodological practice in machine learning, consistently outperforms the previous state of the art across evaluation settings and can materially improve the reliability of CNS tumor classification.
0
0
q-bio.GN 2026-06-30

Pretraining may not pay off for DNA transformers

by Romain Karpinsky, Julien Mozziconacci +1 more

DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

Benchmarks compare transformers to convolutional models to quantify gains from pretraining and BPE on fine-tuning tasks

Figure from the paper full image
abstract click to expand
Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is crucial to evaluate whether their performance gains justify this overhead. Moreover, LLMs such as DNABERT2 typically rely on Byte Pair Encoding (BPE) tokenization, whose relevance for DNA sequence representation is still debated within the genomics community. In this work, we investigate three key questions: (i) do transformer-based models provide sufficient improvements on fine-tuning tasks upon heavy pretraining, (ii) what is the actual contribution of pretraining in this setting, and (iii) how does BPE tokenization impact performance on genomics-related tasks?
0
0
eess.IV 2026-06-30

Lightweight module turns H&E slides into molecular pathway predictors

by Dominik Winter, Dominik Vonficht +7 more

Data-Efficient Multimodal Alignment for Histopathology-based Molecular Prediction

Contrastive training on 1720 samples aligns frozen models for 25-fold better gene-set retrieval without new sequencing.

Figure from the paper full image
abstract click to expand
H&E-stained whole-slide images offer cohort-scale availability and rich spatial context but lack molecular specificity, whereas bulk RNA-seq provides transcriptome-wide resolution at high cost with limited archival availability. We show that training a lightweight alignment module atop frozen histopathology and RNA-Seq foundation models enables open-vocabulary molecular prompting -- querying H&E slides with gene-set signatures to predict pathway activity without sequencing or end-to-end retraining. Using contrastive learning on a multi-cancer cohort (N=1,720), we achieve a 25-fold improvement in retrieval over baseline methods. Systematic analysis reveals a graduated predictability spectrum: morphologically grounded programs (cell-cycle programs, immune-related) are most reliably predicted (R^2>0.5), while predicting pathways with no morphological footprint remains challenging as expected. We validate clinical utility on the POSEIDON clinical trial: H&E-predicted squamous cell carcinoma scores recapitulate NSCLC subtype identity and predicted IFN-gamma mirror PD-L1 tumor-cell expression groups. Furthermore, genesets describing immune activation and fibrosis predict known tumor microenvironment archetypes from histology alone. We further validate generalization of our approach across unseen cohorts and demonstrate data-efficient domain adaptation, establishing a slide-native framework for molecular analysis on H&E images.
0
0
q-bio.GN 2026-06-29

Hybrid system turns spatial transcriptomics into reproducible SLURM bundles

by Myles Joshua Toledo Tan, Vasco Gerardo Hinostroza Fuentes +5 more

DiSTILL: A Hybrid Cloud-HPC Workflow System for Reproducible Spatial Transcriptomics Analysis

DiSTILL uses cloud registries and a pipeline generator to produce consistent HPC execution packages across restricted environments.

Figure from the paper full image
abstract click to expand
Spatial transcriptomics workflows increasingly combine large annotated data objects, notebook-based analyses, and resource-intensive statistical models that must be executed on high-performance computing (HPC) systems. In practice, these workflows are often difficult to reproduce because configuration, validation, stage execution, and artifact handling are fragmented across $\textit{ad hoc}$ scripts and manually edited notebooks. We present $\textit{DiSTILL}$ (Disease Diagnosis from Spatial Transcriptomics via Interpretable Latent Learning), a hybrid cloud$-$HPC workflow system for reproducible spatial transcriptomics (ST) analysis. DiSTILL combines an application programming interface (API) backend built with $\texttt{FastAPI}$, a web frontend, a dataset and preset registry, and a Python pipeline generator that materializes run-specific execution bundles and $\texttt{SLURM}$ submission scripts. The system supports local, Secure Shell (SSH)-mediated, and pull-based poller execution modes, enabling HPC submission in environments where persistent API-initiated automation is restricted. We describe the system through the lens of an inflammatory bowel disease (IBD) ST workflow that operationalizes the analytical pipeline of Tan $\textit{et al.}$ into an auditable application layer. Accordingly, the contribution of this paper is a workflow systems contribution centered on reproducible execution, queue-based orchestration, configuration semantics, and deployment across a split cloud$-$HPC architecture. The broader application goal of DiSTILL is to support user-supplied datasets that satisfy the schema assumptions of the wrapped analytical pipeline.
0
0
cs.LG 2026-06-29

Recovered expressions rebuild cell graphs for better scRNA-seq clusters

by Jun Tang, Pengwei Hu +4 more

scKDGM: KAN-guided Dynamic Graph Masked Learning for Single-Cell RNA-seq Clustering

scKDGM updates topology from mask-guided recoveries and beats fixed-graph baselines on 12 datasets by NMI and ARI.

Figure from the paper full image
abstract click to expand
Single-cell RNA sequencing (scRNA-seq) clustering is essential for identifying cell types, but high dimensionality, sparsity, dropout, and technical noise hinder robust expression representation and cell graph construction. Existing masked autoencoders mainly use expression recovery for feature reconstruction, while graph clustering methods usually depend on fixed KNN graphs and do not feed recovered expression back into graph optimization. We propose scKDGM, a KAN-guided dynamic graph masked learning framework for scRNA-seq clustering. scKDGM uses graph-aware distribution preserving gene masking (GDP-Mask) to perturb cell identity, a KAN-based TAKGCN encoder to learn masked-view representations, mask-guided expression recovery to construct a dynamic graph, and cross-view contrastive learning to transfer recovery signals into topology updates. A ZINB loss models overdispersion and zero inflation. Experiments on 12 real scRNA-seq datasets show that scKDGM outperforms 10 baselines in average NMI and ARI.
0
0
cs.LG 2026-06-29

Two-stage tuning aligns proteins to target amino-acid mixes

by Violeta Basten-Romero, Rubén Muñoz-Tafalla +4 more

Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition

Fine-tuning shifts average composition; reinforcement learning then enforces exact matches while sequence quality stays intact.

Figure from the paper full image
abstract click to expand
Protein language models are standard priors for biological sequence generation, but steering them toward explicit distributional design targets remains largely unexplored. We study a constrained protein generation problem in which sequences must match a desired amino-acid (AA) composition profile while preserving plausible sequence statistics and diversity. The motivating application is synthetic feed protein design, where the AA composition of dietary proteins directly determines their nutritional value. We propose a two-stage pipeline in which domain-adaptive fine-tuning (FT) on an in-domain protein dataset is followed by iterative reward-weighted FT via reinforcement learning (RL) anchored against the FT model as a frozen reference. We evaluate the pipeline on two AA compositions and find that FT brings the average composition close to the target, while the subsequent RL enforces specific sequence constraints that FT alone cannot satisfy. We additionally evaluate the design choices of the proposed composition reward term against two baselines and an ablated variant, isolate the contribution of each training stage, and verify that AA composition alignment is achieved without degrading sequence quality.
0
0
q-bio.GN 2026-06-29

scRNA-seq maps human fat-cell development to 15 states

by Weny S. M Sitinjak, Humasak Tommy Argo Simanjuntak

Reconstructing the Developmental Trajectory of Adipocytes in Human Adipose Tissue Using Single-Cell RNA Sequencing

Analysis of adipose tissue finds seven transitional states and names IGF and FGF pathways as the main signals active throughout differentiat

abstract click to expand
Obesity is a global health crisis associated with metabolic disorders such as type 2 diabetes and cardiovascular disease. This study employed single-cell RNA sequencing to reconstruct the developmental trajectory of human adipocytes from adipose tissue samples. Our analysis identified 15 transcriptionally distinct cell clusters, including 7 transitional states, revealing the dynamic process of adipocyte differentiation. We detected 16 functionally active signaling pathways mediating cellular communication between adipocytes and their progenitors. Among these, insulin-like growth factor (IGF) and fibroblast growth factor (FGF) pathways emerged as the most prominent networks, showing consistent activity across differentiation stages (p<0.05). The study revealed depot-specific differences, with visceral adipocytes undergoing additional extracellular matrix remodeling absent in subcutaneous differentiation. Spatial analysis further showed that IGF signaling was particularly active in perivascular niches, while FGF activity dominated in mature adipocyte zones. These results provide the first comprehensive map of human adipocyte development, highlighting IGF and FGF pathways as potential therapeutic targets. The identified signaling networks offer new insights for developing interventions to promote healthy adipose expansion or inhibit pathological fat accumulation. This work advances our fundamental understanding of adipose tissue biology while providing clinically relevant data for metabolic disorder treatments.
0
0
q-bio.GN 2026-06-26

GRAFT dataset links genes to traits in same Arabidopsis plants

by Manuel Serna-Aguilera, Vanshika Jindal +6 more

GRAFT: Biological Graph and Hypergraph Benchmarks for Linked Gene Expression and Phenotypic Trait Prediction in Arabidopsis thaliana

First resource pairs gene expression profiles with heterogeneous phenotypic data from identical specimens for genome-to-phenome mapping.

Figure from the paper full image
abstract click to expand
Understanding which genes control which traits in an organism remains one of the central challenges in biology. Despite significant advances in data collection technology, our ability to map genes to traits is still limited. This genome-to-phenome (G2P) challenge spans several problem domains, including plant breeding, and requires methods capable of reasoning over high-dimensional, heterogeneous, and biologically structured data. Current datasets and data repositories, however, are not well-equipped for this task. Current studies do not link gene expression and trait data, and most focus on very specific traits, limiting the breadth of possible correlations. To address this gap, we present the novel Gene-Graph Regression for Arabidopsis Functional Traits (GRAFT) dataset, a curated multi-modal dataset linking gene expression profiles with phenotypic trait measurements in Arabidopsis thaliana, a model organism in plant biology. GRAFT supports tasks such as phenotype prediction and interpretable graph learning. In addition, we benchmark conventional regression and explanatory baselines, including a biologically-informed hypergraph baseline, to validate gene-trait associations. To the best of our knowledge, this is the first dataset to provide multimodal gene information and heterogeneous trait or phenotype data for the same Arabidopsis thaliana specimens. With GRAFT, we aim to foster research to accurately understand the relationship between genotypes and phenotypes using gene information, higher-order gene pairings, and trait data from multiple sources.
0
0
q-bio.GN 2026-06-26

AI agents succeed on 25% of long single-cell biology tasks

by Ian Diks, Zhen Yang +3 more

scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology

Benchmark of 21 evaluations shows models rarely recover complex claims from raw sequencing data without prescribed methods.

abstract click to expand
Single-cell studies require analysts to convert raw measurements into specific biological claims through multi-step workflows and integration of metadata, assay context, and auxiliary evidence. Existing AI-biology benchmarks largely measure broad knowledge, executable workflows, or local analysis steps. We introduce scBench-Long, a benchmark for long-horizon single-cell biology in which agents must recover scientific conclusions from raw or near-raw data without prescribed methods. The benchmark contains 21 evaluations spanning melanoma CD8 T-cell reactivity, CD8 RNA+ATAC regulatory inference, human--monkey chimera development, KRAS-driven lung tumor aging, and lethal COVID-19 lung pathology. Tasks cover paired scRNA/TCR sequencing, RNA and chromatin profiling, cross-species transcriptomics, combinatorial scRNA-seq, single-nucleus RNA-seq, immune repertoires, ortholog maps, ligand--receptor resources, and validation evidence. Candidate claims are reproduced, reviewed, and converted into controlled answer vocabularies with deterministic grading and trajectory rubrics. Across 1,068 completed trajectories, the strongest model--harness pair passes 16/63 runs (25.4\%). scBench-Long evaluates whether agents can move beyond local analysis steps and make complex scientific claims that are supported by single-cell data.
0
0
cs.AR 2026-06-26

GRAINS runs genome graphs inside SSDs for up to 47.8x speedup

by Nika Mansouri Ghiasi, Harun Mustafa +9 more

GRAINS: Storage-Aware Algorithm-Architecture Co-Design Enabling High-Performance and Low-Cost Graph-Based Genome Analysis

Storage-aware batching and repurposed flash scheduling cut data movement that dominates large genomic graph analysis.

Figure from the paper full image
abstract click to expand
Graph-based representations of genome sequences have emerged as a powerful approach for representing massive genomic databases in an expressive and efficient way. Despite their benefits, analysis on large-scale genome graphs incurs significant data movement overhead from the storage system due to accessing large amounts of low-reuse data. Processing data directly inside the storage device can be a fundamental solution for mitigating this overhead. However, none of the existing tools for graph-based genome analysis can be efficiently used inside the storage system due to the limited internal hardware resources in modern SSDs. At the same time, prior storage-centric systems developed for (i) traditional, linear non-graph-based genome analysis or (ii) conventional, non-genomic graph analysis are not suitable for the unique data structures and access patterns of graph-based genome analysis. We propose GRAINS, the first system for analysis with large-scale genome graphs in storage. Through our detailed examination of typical analysis pipelines that operate on genome graphs, we perform storage-aware algorithm-architecture co-design to (i) make these pipelines more storage-friendly and (ii) further improve performance, energy-efficiency, and cost via in-storage and in-flash processing. GRAINS's co-design is based on three key aspects. First, we propose a new batching and execution flow, based on unique features of genome graphs. Second, via in-flash and in-storage processing, we avoid transferring low-reused flash pages. Third, to leverage the full parallelism of flash dies, we design an effective, yet lightweight, scheduling technique, enabled by re-purposing the existing SSD structures. GRAINS provides 2.7x-47.8x speedup (4.4x-31.6x energy reduction) over the state-of-the-art software baselines, and 1.5x-17.0x speedup (3.1x-20.7x energy reduction) over a hardware-accelerated baseline.
0
0
q-bio.GN 2026-06-24

Agentic workflow hits 66% top-1 recall on birth defect variants

by Shiyu Li, Ziqi Yan +8 more

DeepBD: A Grounded Agentic Workflow for Variant Prioritization and Diagnosis of Genetic Birth Defects

DeepBD layers rule evidence, mechanistic context and specialist modules to beat Exomiser on 18k-case internal benchmark

Figure from the paper full image
abstract click to expand
Birth defects are a major cause of fetal loss, neonatal morbidity and long-term disability. In the subset with suspected genetic etiologies, exome and genome sequencing have moved many cases from variant detection to post-sequencing interpretation: clinicians must rank patient-specific candidate variants under incomplete fetal or infant phenotypes and heterogeneous evidence from population genetics, variant-effect prediction, gene-disease validity, phenotype ontologies, cellular and pathway context, protein structure and clinical literature. We present DeepBD, a grounded agentic workflow for variant prioritization and diagnostic interpretation of genetic birth defects. DeepBD organizes the workflow into LLM-assisted case structuring, a pretrained evidence engine, specialist evidence modules and a grounded diagnostic review layer. The evidence engine learns patient-specific variant scores from structured rule evidence, sequence and variant-effect representations and phenotype-conditioned biological context, whereas specialist modules and the agentic layer provide tool-based refinement, candidate-pool review and diagnosis-oriented synthesis from ranked candidates. Developed using an in-house fetal and infant cohort comprising 18,622 cases, DeepBD achieved Recall@1/3/5/10 of 0.658/0.882/0.912/0.929 on an internal held-out solved-case benchmark, outperforming standalone Exomiser, DeepRare and prompted LLM reranking baselines evaluated on Exomiser-derived top-20 candidate variants. Ablation and overlap analyses show that rule evidence, mechanistic context, and specialist refinement provide complementary signals. These findings support a grounded agentic workflow that separates evidence integration, tool-based refinement, and LLM-assisted diagnostic review for retrospective variant prioritization in genetic birth defects.
0
0
cs.DC 2026-06-24

cuSBF hits 9x throughput on genomic Bloom filters

by Tim Dortmann, Markus Vieth +1 more

cuSBF: A Minimizer-Aware Bloom Filter for Genomic Sequence Data on Modern GPUs

Minimizer-guided shards and warp reductions sustain 85 percent GPU utilization even for out-of-cache sequence filters.

Figure from the paper full image
abstract click to expand
Efficient genomic k-mer indexing depends on approximate membership query (AMQ) structures that must deliver high throughput, low false-positive rates (FPR), and modest memory footprints. The Super Bloom filter (SBF) is attractive for this scenario because minimizer-guided sharding and the Findere scheme exploit the redundancy of overlapping k-mers. However, those same features cause high per-k-mer compute cost, severe register pressure, and irregular memory accesses, which hinder an effective GPU implementation. We present cuSBF, an open-source, header-only CUDA library that implements SBF for sequence-native workloads. cuSBF's design merges sectorized shards, cooperative shared-memory tiling, warp-level shard sharing, and segmented warp reductions, turning super-k-mer locality into scalable GPU parallelism. Across real genomic workloads on RTX PRO 6000 Blackwell and GH200 systems, cuSBF achieves the highest throughput among all evaluated sequence-capable baselines. On the RTX PRO 6000, it surpasses the cuCollections blocked Bloom filter baseline by up to 9.1x for insertion and 7.7x for query, while reaching up to 92x and 234x speedups over the multi-threaded CPU Super Bloom reference implementation. It also outperforms GPU-based dynamic AMQs (Cuckoo, Two-Choice, Quotient filters) by 1.5-3400x depending on workload characteristics. A parameter sweep identifies (s = 28, m = 16, H = 4) as Pareto-optimal for k = 31, yielding significantly lower FPR than cuCollections at matched memory budgets. Crucially, cuSBF's architecture-aware design sustains 85% streaming multiprocessor utilization even for out-of-cache filters - proving that sequence locality, not raw bandwidth, is the key to GPU-accelerated genomic indexing.
0
0
q-bio.GN 2026-06-23

Biological context maps unseen genes into response basis

by Sajib Acharjee Dip, Liqing Zhang

Stable-Shift: Biologically Structured Prediction of Transcriptional Responses to Unseen Gene Perturbations

Low-rank basis from training perturbations plus graph convolution on interactions yields higher similarity than prior methods on K562 data

Figure from the paper full image
abstract click to expand
Predicting transcriptional responses to genetic perturbations could reduce the experimental burden of functional genomics, but extrapolation to genes that were never perturbed during training remains difficult. We present Stable-Shift, a structured method for estimating unseen-gene responses. Stable-Shift aggregates single-cell measurements into perturbation-level expression shifts, fits a low-rank response basis using training perturbations only, and predicts an unseen gene's coordinates in that basis from biological context. The context combines STRING interactions, network structure, control-cell expression statistics, and Gene Ontology annotations; the evaluated implementation uses graph convolution to integrate these inputs. On the supplied K562 Perturb-seq benchmark, Stable-Shift obtained 0.592 cosine similarity, compared with 0.569 for GEARS, together with higher Spearman correlation and top-gene precision among the evaluated methods. Its mean cosine similarity over five unseen-gene splits was 0.589 +/- 0.008. The same ordering was observed in the supplied graph-aware, residualized, gene-space, and Norman-dataset comparisons. These results support further study of biologically structured latent-response prediction, while the lower gene-space accuracy and sensitivity to sparse graph neighborhoods limit the scope of the present conclusions.
0
0
q-bio.GN 2026-06-23

Federated SVD merge recovers immune programs across institutions

by Axel Faes, Stephanie M. van den Berg +1 more

Privacy-preserving federated tensor decomposition of single-cell immune data: recovering multicellular programs across institutions

Global-mean centering produces results equivalent to centralized tensor decomposition on multi-site lupus and COVID atlases.

Figure from the paper full image
abstract click to expand
Tensor decomposition of donor $\times$ cell-type $\times$ gene single-cell data recovers \emph{multicellular programs}: coordinated axes of inter-individual transcriptional variation that span cell types and stratify disease. Yet immune single-cell atlases are increasingly multi-institution, multi-ancestry, and governed, so patient cells often cannot be pooled. We present a federated estimator: each site computes a local program subspace, and a coordinator merges these by stacked SVD under federated global-mean centering, provably equivalent (up to truncation) to the centralised decomposition. This centering makes the merge robust to site-label confounding (program AUC $0.957$ vs.\ $0.861$ for naive per-site centering). Only program subspaces leave a site, and aggregation is compatible with secure aggregation. On a 261-donor systemic lupus erythematosus atlas it recovers the canonical interferon program (ISG enrichment AUC $0.998$; case--control separation $0.958$; bootstrap $\Delta\text{AUC}=-0.000$, 95\% CI $[-0.004,+0.012]$ vs.\ centralised), across institution-scale and multi-ancestry partitions, and across three \emph{real} COVID-19 sites (subspace correlation $0.989$). It recovers the program when \emph{no site observes all cell types} (correlation $1.000$, exact by construction), which fixed-feature federated PCA cannot. On an interstitial-lung-disease atlas the recovered program predicts disease better than the best single cell type (AUC $0.96$ vs.\ $0.91$; gap 95\% CI excludes zero) and the advantage survives federation; a liver cohort is consistent ($p=0.005$). Membership-inference shows secure aggregation cuts attack AUC from $0.91$ to $0.61$. The method enables cross-institution, cross-ancestry recovery of multicellular immune programs without sharing cells.
0
0
cs.CV 2026-06-22

Omics data forms testable hypothesis for WSI region retrieval

by Xiangyu Li, Ran Su

HERO: Hypothesis-Driven Evidence Retrieval from Omics for Multi-Task Breast Cancer Analysis

HERO converts methylation and miRNA into an intent vector that selects and verifies morphology regions, setting new SOTA on five tasks.

Figure from the paper full image
abstract click to expand
Matched multi-omics can improve WSI-based biomarker and prognosis prediction, but most existing pipelines use omics as a paral lel feature stream or textual context rather than as an explicit retrieval constraint. HERO asks whether observed omics can be a testable mor phology hypothesis: a sparse pathway-to-morphology prior maps DNA methylation and miRNA into a K-dimensional intent vector m (K=16), TF-IDF retrieval over structured 10 captions selects endpoint-relevant regions, and a cosine gate c=cos(m,v) triggers deterministic deficit driven repair when c<{\tau}c. This closed-loop design bounds VLM calls, reduces reliance on embedding-based semantic matching, and makes every retrieval and verification step lexically auditable. On TCGA-BRCA (930WSIs, patient-level 5-fold CV), HERO sets new state-of-the-art across ER, PR, HER2, subtype, and risk prediction, outperforming both multimodal fusion and VLM-based baselines.
0
0
q-bio.GN 2026-06-19

Confidential Beacon queries run on fully homomorphic EVM

by Christos Galanopoulos, Kimon Antonios Provatas +1 more

bioETH-Beacon: A Confidential On-Chain Genomic Beacon with Encrypted Counts, Filters, and Bounded Noise over a Fully Homomorphic EVM

Prototype executes encrypted variant counts and filters, releasing answers only to ACL-named requesters without trusted compute.

Figure from the paper full image
abstract click to expand
The Global Alliance for Genomics and Health (GA4GH) Beacon protocol lets researchers ask whether a genomic variant has been observed in a participating cohort and receive aggregate variant-level counts. As Beacon networks grow, two privacy risks remain: host institutions can see plaintext queries, and repeated rare-variant queries can support membership-inference attacks. We present bioETH-Beacon, a smart-contract prototype that runs the Beacon "aggregate count" query over encrypted data on a fully homomorphic Ethereum Virtual Machine (fhEVM). Hospitals upload encrypted marker-count entries, authorized researchers submit encrypted marker queries, and the contract returns an encrypted answer that is released, via an off-chain key-management service, only to the requester named in the contract's on-chain ACL. The design is organized as a 3x4 tier-by-query-family grid spanning genotype, sex, age, and phenotype queries, with tiers that trade stronger confidentiality for lower query cost. For genotype paths, the prototype can add bounded on-chain noise to mitigate probing attacks. Experiments on synthetic panels derived from a Polygenic Score (PGS) catalog show the expected scaling behavior and demonstrate that pre-aggregation can substantially reduce query gas when public marker presence is an acceptable trade-off. Overall, bioETH-Beacon provides a research prototype for confidential Beacon-style genomic querying without a trusted compute evaluator.
0
0
cs.LG 2026-06-18

Siamese graph transformer improves single-cell RNA clustering accuracy

by Jinke Wu, Yifan Wang +6 more

scGTN: Deep Siamese Graph Transformer Network for Single-cell RNA Sequencing Clustering

Dual augmented views and shortest-path distances in the network capture cell relationships missed by existing methods.

Figure from the paper full image
abstract click to expand
Single-cell RNA sequencing (scRNA-seq) serves a pivotal role in characterizing gene expression at the cellular level, enabling the identification of cell types and advancing the understanding of cellular heterogeneity. Despite the significant progress in scRNA-seq data clustering, we argue that current methods always ignore the sparsity and noise, as well as the complex intercellular structural information inherent in scRNA-seq data. Toward this end, in this paper, we propose a novel single-cell RNA-seq clustering framework via deep Siamese Graph Transformer Network (termed scGTN), which explicitly integrates gene expression profile and intercellular structural dependencies for cell clustering. In particular, we formulate scRNA-seq data as a graph and construct two augmented graph views that serve as dual views to capture complementary intercellular information. Then, a Siamese graph transformer network is employed to explicitly incorporate shortest-path information and node-wise distances for capturing richer structural relationships between cells. Finally, we employ an optimal transport strategy to guide the cell clustering in a self-supervised manner. Extensive experiments on multiple benchmark scRNA-seq datasets demonstrate that our scGTN consistently outperforms existing methods. Our code is available at https://github.com/W-RMSL/scGTN.
0
0
q-bio.GN 2026-06-17

Tool builds standard TSV matrix from ATAC-seq peaks

by Saroja Somasundaram, Nelson J. Johansen +2 more

PyPeakRankR: Reproducible Peak-Level Feature Extraction for Regulatory Element Ranking

PyPeakRankR extracts signal, GC, conservation and specificity features into one reproducible file that decouples extraction from ranking.

Figure from the paper full image
abstract click to expand
High-throughput chromatin accessibility assays such as ATAC-seq generate thousands of candidate regulatory elements (peaks), yet no standardized tool exists for assembling the diverse quantitative features needed to prioritize peaks for functional validation. Here we present PyPeakRankR, an open-source Python package that extracts peak-level features, namely BigWig signal summaries, GC content, PhyloP conservation scores, distribution moments (kurtosis, skewness, bimodality), and cell-type specificity rankings, into a single reproducible peak by feature matrix stored as a tab-separated values (TSV) file. PyPeakRankR separates deterministic feature extraction from downstream ranking, enabling transparent benchmarking of prioritization strategies on the same upstream data. The package provides both a command-line interface and a matching Python API, supports cross-assembly scoring via liftOver, and runs in minutes on thousands of peaks. PyPeakRankR was validated in the Brain Initiative Cell Census Network (BICCN) community challenge, where its predecessor PeakRankR ranked among the top 3 of 16 methods for cell-type specific enhancer prediction. In a recent basal ganglia study, PyPeakRankR was used within the Cross-species Enhancer Ranking Pipeline (CERP) to identify enhancer-AAV tools achieving greater than 70% on-target specificity across cell types. PyPeakRankR is freely available under the MIT license at https://github.com/AllenInstitute/PeakRankR/tree/python-package.
0
0
cs.AI 2026-06-12

Genomic profile sets Bayesian prior to separate nature from nurture

by Aruna Dey, Suraj Biswas

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

Fixed genetic anchor distinguishes constitutional from environmental effects from the first measurement onward

Figure from the paper full image
abstract click to expand
Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.
0
0
q-bio.QM 2026-06-12

Vanilla Transformer reaches SOTA on cell perturbation prediction

by Danning Jiang, Zheming An +2 more

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

OCOO-T conditions a standard denoising model with layer normalization and tokens to handle long gene profiles without extra encoders.

Figure from the paper full image
abstract click to expand
Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.
0
0
q-bio.GN 2026-06-11

m6A-FORM predicts m6A sites at PR-AUC 0.635

by Tinghe Zhang, Sumin Jo +2 more

m6A-FORM: A Foundation Model for Decoding N6-methyladenosine Biology

Pretrained on 22 million peak-derived sequences, the transformer model also supports regulator binding prediction and identifies tissue-cons

abstract click to expand
N6-methyladenosine (m6A) is the most abundant internal modification in eukaryotic mRNA. However, most existing predictors use adenosine-centered formulations that are computationally inefficient and prone to false positives. Here we present m6A-FORM, a transformer-based foundation model for RNA methylation that uses MeRIP-seq peaks as methylation-enriched priors and is pretrained on approximately 22 million peak-derived sequences from 143 human MeRIP-seq studies. After fine-tuning with high-confidence single-nucleotide m6A annotations from m6A-Atlas v2.0 and GLORI, m6A-FORM-sites achieves state-of-the-art m6A site prediction performance, with a PR-AUC of 0.635 and ROC-AUC of 0.988, improving PR-AUC by at least 0.14 over existing methods while enabling substantially faster inference. Task-specific adaptation further supports prediction of binding sites for 19 m6A-associated regulators and identification of YTHDF2-bound m6A sites associated with mRNA degradation. Applying m6A-FORM across 67 datasets from 24 human tissues identifies 19,631 tissue-conserved sites with distinct localization, clustering, methylation, expression, RBP-interaction, and decay-associated signatures.
0
0
cs.LG 2026-06-10

Single-timepoint NGS fails to predict osimertinib resistance above chance

by Abhijoy Sarkar, Aarchi Singh Thakur

OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

Benchmark of 813 patients shows every model class hits the same ceiling, pointing to serial ctDNA as the missing ingredient.

Figure from the paper full image
abstract click to expand
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.
0
0
q-bio.GN 2026-06-10

Motif-distance distributions rank centromere assemblies

by Luca Franco, Matteo Migliarini +7 more

A mathematical framework for centromere-aware evaluation of human genome assemblies

KL divergence on inter-motif distances benchmarks T2T genomes where sequence alignment fails, producing per-chromosome scores.

abstract click to expand
Accurate evaluation of genome assemblies within highly repetitive regions, such as centromeres, remains a major open challenge in genomics. Conventional benchmarking relies on sequence alignment, which becomes problematic in regions of high homogeneity and divergence. Here, we framed centromere assembly evaluation as a comparative distribution problem in a compact centeny representation by computing genomic distances between functional motifs, rather than relying on nucleotide sequence. Our distribution-based metric assesses agreement between a query and a target chromosome by comparing their centromeric inter-motif distances rendered by KL divergence. When applied genome-wide to currently available human telomere-to-telomere (T2T) genomes, this approach yields an accuracy ranking for the entire assembly and for each individual chromosome. Altogether, we present a rapid and robust scoring system based on genomes numerical rendering of inter-motif distances, that provides a quantitative standard of assembly integrity in repetitive DNA regions and establishes a bona fide framework for chromosome-level genome-to-genome comparison.
0
0
q-bio.GN 2026-06-09

Regulatory priors in attention lift scRNA-seq cell classification

by Mikele Milia, Louis Fabrice Tshimanga +3 more

Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis

scTransformer shows higher accuracy and attention weights that match known gene regulations by limiting flow to established structures.

Figure from the paper full image
abstract click to expand
Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledge of biological mechanisms into the model's attention patterns. By constraining information flow according to known regulatory structures, the model learns representations that are more biologically meaningful. We evaluate scTransformer on a disease-relevant single-nucleus RNA-seq dataset using supervised cell-type classification. Compared to standard Transformers, our approach improves classification accuracy, enhances separation of cell types in embedding space, and produces attention patterns consistent with known regulatory programs. Overall, our results demonstrate that embedding biological structure into Transformer models can enhance interpretability without sacrificing performance, offering a principled step toward biologically grounded foundation models for single-cell omics.
0
0
q-bio.GN 2026-06-08

Disentanglement separates cell state from neighbors for tissue counterfactuals

by Abdul Moeed, Stefan Schrod +5 more

Querying Counterfactuals on Tissue Graphs with Supervised Disentanglement

A framework predicts how expression changes when cell connections or neighbor profiles are altered, tested on 2.5 million cells in cancer an

Figure from the paper full image
abstract click to expand
Tissue graph counterfactuals ask how a cell's expression would change under altered spatial neighbor contexts. Such queries are central to predicting cell behavior in tissues, but lack a unified definition, with existing methods targeting specific intervention types or treating cells as i.i.d. In this work, we first formalize tissue graph counterfactuals as a class of spatial interventions that either rewire connections between cells (edge perturbation) or modify the expression of their neighbors (node perturbation). We then introduce Cellina (https://cellina.readthedocs.io) - a framework that uses supervised disentanglement to decompose a cell's intrinsic state from its spatial context, using the latter as a conditioning input for counterfactual predictions. Across benchmarks spanning over 2.5 million spatially-resolved cells in colorectal cancer and mouse brain, Cellina outperforms spatially-informed and non-spatial competitors in in-silico graph perturbations, disentanglement, and scalability. Additionally, we show that Cellina reveals biologically distinct cancer subdomains in an unsupervised manner and enables targeted neighbor perturbation simulations.
0
0
q-bio.GN 2026-06-08

Biological reasoning boosts LLM accuracy on regulatory DNA

by Yi Duan, Zhao Yang +4 more

Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction

Two-stage training on structured sequences and mechanistic traces yields state-of-the-art enhancer predictions with explanations.

Figure from the paper full image
abstract click to expand
DNA cis-regulatory elements (CREs) such as enhancers control gene expression levels. Accurately predicting regulatory activity from DNA sequences is valuable but challenging, as it requires understanding complex biological regulatory processes. Existing methods typically regress activity scores from sequences in a black-box manner, limiting both interpretability and regression performance. Meanwhile, large language models (LLMs) benefit from explicit reasoning processes, yet directly applying LLMs to raw DNA sequences performs poorly. In this paper, we bridge this gap by introducing R3LM, a framework that teaches LLMs reasoning-informed regression on regulatory DNA through structured biological knowledge. Specifically, we design a biologically grounded data format that structures DNA's regulatory information for improved LLM understanding, and construct CRE-ReasonBench, the first dataset that associates DNA sequences and activity scores with mechanistic reasoning traces. Through two-stage training that first teaches LLMs reasoning over structured biological information then performs regression, R3LM achieves state-of-the-art performance on enhancer prediction across three cell types, outperforming both LLMs with raw sequence input and specialized DNA models while providing interpretable mechanistic explanations. We expect R3LM as an interpretable reward model that can effectively assist biologists in CRE design. Code is available at https://github.com/DuanYi516/R3LM.
0
0
q-bio.GN 2026-06-08

Neural networks classify palimpsests from mtGenome data in old parchment

by James B. Harr III, Madelin E. Blong +3 more

From Genomes to Algorithms: Neural Network Applications for Palimpsest Detection in Medieval Manuscripts

Sequencing shows similar coverage in single-use and reused folios, yet models achieve high precision on one 14th-century manuscript.

abstract click to expand
Biocodicology, the study of biological information preserved in manuscripts, offers new opportunities to examine parchment as both a textual and biological artefact. This study applies non-destructive sampling to isolate and sequence mitochondrial genomes (mtGenomes) from a 14th-century manuscript, Ms. Codex 1629, which contains both single-use and palimpsested folios. We sought to evaluate whether palimpsest preparation, including chemical washing, compromised DNA integrity and whether computational methods could aid in identifying reused parchment. DNA sequencing revealed that both single-use and palimpsested parchments retained sufficient mtGenomes for analysis, with no significant differences in genome coverage or depth. To assess the potential of computational biology in manuscript studies, we implemented machine learning classifiers, including logistic regression and neural networks, to distinguish palimpsests from single-use folios. Models achieved high precision but exhibited reduced recall for the minority palimpsest class, reflecting dataset imbalance. While additional ancient mtGenome samples from palimpsest are required and further testing is needed, this study demonstrates how integrating molecular biology and neural networks highlights new approaches for palimpsest detection and underscores the evolving role of data science in biocodicology.
0
0
cs.CL 2026-06-08

Diagnostic splits predictability from regulation in DNA models

by Chahat Baranwal, Aadtya Baranwal +1 more

The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

Zero overlap in top-100 lists across three models, a 10kb local horizon, and 3.3x eQTL enrichment survive all controls.

Figure from the paper full image
abstract click to expand
High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.
0
0
cs.PF 2026-06-08

Dataflow optimization parallelizes genome aligners across regions

by Shiv Sundram

Dependencies and Dataflow in Seed-Filter-Extend Pipelines

Synthesizing four prior aligners removes serial constraints so candidate regions run in parallel and local alignments move to GPUs without a

Figure from the paper full image
abstract click to expand
Comparing genomes is critical for discovering mutations, tracking evolutionary lineages, and advancing cross-species genomics. Fundamentally, this reduces to an O(n^2) string-matching dynamic programming (DP) problem, a challenge that has driven decades of performance research. However, executing a strict O(n^2) DP algorithm is computationally intractable for genomes spanning millions to billions of base pairs. Consequently, modern aligners rely on global heuristics to identify thousands of candidate similarity regions between species. Unfortunately, these methods are burdened by complex serial dependencies. Once candidate regions are identified, the pipeline executes localized DP alignments, which introduce their own non-trivial heuristics and irregular data dependencies. While parallelizing dense, two-dimensional DP is a well-studied problem, accelerating this end-to-end pipeline is significantly more challenging. Parallelizing across candidate regions and offloading irregular, heuristic-laden local alignments to modern hardware (such as GPUs) remains a major hurdle. In this work, we address the challenge of overcoming these serial bottlenecks by optimizing the global pipeline across regions. We take inspiration from four papers: LASTZ, SegAlign, Darwin-WGA, and SNAP, synthesizing findings across each to inform optimizations, which we either prototype or implement directly in LASTZ.
0
0
q-bio.GN 2026-06-05

Adversarial fine-tuning bridges unpaired single-cell modalities

by Joseph Boyd, Matthew Lyon +3 more

Single-Cell Cross-Modal Transfer by Adversarial Fine-Tuning of Foundation Models

A foundation model recovers spatial neighbourhood signals from scRNA-seq using ST references without paired samples.

Figure from the paper full image
abstract click to expand
Spatial transcriptomics (ST) is a powerful tool for exploring biological properties dependent on structure, proximity, and interaction in tissue. The methods underpinning ST are developing rapidly but are limited in their ability to profile many thousands of genes at a subcellular scale. Although dissociated from tissue, it is known that the whole-transcriptome readouts of cells in single-cell RNA sequencing (scRNA-seq) retain information about their former in situ neighbourhoods, motivating computational methods to recover it. While paired ST and scRNA-seq datasets are scarce, each modality in its own right is abundantly available. We therefore propose to perform cross-modal translation between unpaired ST and scRNA-seq data. In this work we show that a single-cell foundation model can perform this translation via adversarial fine-tuning. We demonstrate that our method performs favourably against methods built for multi-omics translation.
0
0
q-bio.QM 2026-06-05

Bi-filtration with p-adic distances lifts genomic accuracy on small datasets

by Tirtharaj Dash, Gunja Sachdeva

p-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

pVR pairs hierarchical prefix structure and frequency content to gain up to 21 points over baselines on low-sample tasks

Figure from the paper full image
abstract click to expand
We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single $p$-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ($28$ to $500$ sequences, $3$ to $7$ classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to $21$ percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by $6.7$ to $11.4$ percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.
0
0
cs.CL 2026-06-04

Learned boundaries lift histone accuracy 14 points at matched compute

by Daria Ledneva, Denis Kuznetsov

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

LDARNet's unsupervised routing also aligns token edges with promoter motifs and splice junctions on nucleotide inspection.

Figure from the paper full image
abstract click to expand
Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($<$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.
0
0
cs.CL 2026-06-04

Genomic model rankings flip across task types

by Daria Ledneva, Mikhail Nuridinov +1 more

GENEB: Why Genomic Models Are Hard to Compare

Benchmark of 40 models on 100 tasks finds architecture and pretraining beat raw parameter count.

Figure from the paper full image
abstract click to expand
Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.
0
0
cs.LG 2026-06-01

Causal recovery from bulk gene data requires linear aggregation

by Gongxu Luo, Boyang Sun +1 more

On the Recoverability of Causal Relations from Bulk Gene Expression Data

Necessary and sufficient conditions are linear sums or means plus affine equations, but real data deviates from linearity.

Figure from the paper full image
abstract click to expand
Bulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.
0
0
q-bio.GN 2026-06-01

LD-block sparse Bayesian model predicts more cis-eQTL genes

by Lei Huang, Hui Shen +10 more

Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction

bsBSLMM keeps more genes predictable than BSLMM or TIGAR methods and recovers extra TWAS signals in GEUVADIS and an independent cohort.

Figure from the paper full image
abstract click to expand
Genotype-based cis-expression prediction depends on accurately modeling local regulatory architecture. We present block-sparse Bayesian sparse linear mixed model (bsBSLMM), an extension of Bayesian sparse linear mixed model (BSLMM) that incorporates linkage disequilibrium (LD)-block spike-and-slab sparsity and a transcription start site (TSS)-informed SNP inclusion prior. Across 23,098 genes from GEUVADIS European-ancestry lymphoblastoid cell lines, bsBSLMM retained more predictable genes than BSLMM, LASSO, BLUP, TIGAR elastic net, and TIGAR Dirichlet-process regression under matched evaluation criteria. Compared with BSLMM, bsBSLMM improved held-out prediction performance for most shared genes, with gains driven primarily by LD-block sparsity and further enhanced by the TSS-informed prior. Variants selected by bsBSLMM showed stronger enrichment in GM12878 DNase and H3K27ac regulatory regions than variants selected by BSLMM. In transcriptome-wide association study (TWAS) analysis, bsBSLMM recovered established inflammatory bowel disease signals, including IL23R, and identified additional genome-wide significant genes not detected by BSLMM. Independent validation in the Louisiana Osteoporosis Study reproduced the increased prediction yield across ancestries and recovered biologically relevant bone mineral density pathways in downstream TWAS and gene set enrichment analyses. These results demonstrate that incorporating LD-block structure and biologically informed SNP priors improves cis-expression prediction and enhances downstream TWAS discovery.
0
0
cs.LG 2026-06-01

Harmonized perturbation data boosts compound embeddings

by Artur Sza{l}ata, Olga Novitskaia +4 more

Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects

Pretraining on 1.25M samples from eight assays outperforms L1000-only and fingerprint baselines in held-out tests.

Figure from the paper full image
abstract click to expand
Large perturbation models require training data encompassing chemical, cellular, and assay diversity. Current transcriptomic resources for small-molecule modeling, however, are fragmented across technologies, metadata conventions, controls, doses, and preprocessing pipelines. We introduce Chem-PerturBridge, a harmonized multi-dataset resource comprising over 37k compounds, 136 cellular contexts, and 1.25M transcriptomic samples across eight assay types, with standardized identifiers, metadata, and replicate-aware condition-level effects. We use the resource to evaluate matched-condition agreement across datasets and replicate agreement within datasets. Matched same-compound conditions generally show weak agreement in fine-grained logFC rankings and magnitudes across most dataset pairs, often falling below same-context different-compound baselines. In contrast, logFC direction agreement is substantially more stable and usually exceeds these baselines. We further evaluate Chem-PerturBridge as a pretraining resource for compound representation learning. Under a compound-held-out OP3 evaluation split, embeddings pretrained on Chem-PerturBridge improve over L1000-only embeddings, Morgan fingerprints, and the descriptor-free OP3 baseline across metrics. An extensive molecule-holdout evaluation across 11 datasets further shows that models trained on Chem-PerturBridge outperform or match those that are not. Chem-PerturBridge therefore supports both diagnostic evaluation of cross-dataset signature agreement and model-oriented reuse of heterogeneous perturbation transcriptomic data.
0
0
cs.LG 2026-06-01

Genomic AI explanations contradict each other and miss known motifs

by Shasha Zhou, Mingyu Huang +1 more

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

Benchmark shows different interpretability methods disagree on the same predictions and fail to recover regulatory sequences.

Figure from the paper full image
abstract click to expand
Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable machine learning (IML) techniques have been increasingly applied to bridge this gap, there has been a pervasive reliance on anecdotal validation: the vast majority of research relies on a single IML method and reports only isolated successful instances. Through a benchmarking study on transcription factor binding, we demonstrate the risks of current practices. We show that different IML methods can often (1) yield contradictory explanations for the same predictions, (2) fail to localize known regulatory motifs, and (3) fail to faithfully reflect the model's internal decision process. In light of this, we argue for a validation framework analogous to clinical trials: just as trials require rigorous design and adverse-event reporting, genomic interpretability must move beyond cherry-picked plausibility toward systematic assessment of consistency, faithfulness, and biological validity. To facilitate this, we propose a tiered framework to guide rigorous evaluation and reporting of genomic IML methods.
0
0
cs.LG 2026-05-29

Ligand-receptor costs improve cell trajectory inference

by Silas Ruhrberg Estévez, Nicolas Huynh +5 more

CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment

Augmenting optimal transport with directed interaction terms yields better snapshot alignments and interpretable perturbation tests on cance

Figure from the paper full image
abstract click to expand
Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNA-sequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making trajectory inference underdetermined. Optimal Transport (OT) provides a principled framework for snapshot alignment, but a long-standing modeling question is which cost functions yield biologically meaningful couplings. Standard OT approaches rely on gene-expression distances, implicitly treating cells as independent points and neglecting structured cell-cell communication mediated by ligand-receptor signaling. We introduce CellBRIDGE (Cell-Based Regularized Interaction-Driven Gene Expression), which augments feature-based OT with a directed, typed interaction cost derived from ligand-receptor activity. By explicitly modeling cell-cell communication, CellBRIDGE improves cross-snapshot couplings and downstream trajectory estimates across synthetic and real scRNA-seq datasets relative to feature-only baselines. Notably, CellBRIDGE enables mechanistically interpretable in silico perturbations: on lung cancer data, silencing specific ligand-receptor pairs induces trajectory shifts that recapitulate expected effects of targeted pathway inhibition.
0
0
q-bio.GN 2026-05-29

Choroidal endothelial failure starts both AMD forms

by Kyle M. Veksler, Levi Dong +2 more

Meta-analysis of scRNA-seq data for choroidal endothelial cells in dry Age-related Macular Degeneration

Single-cell data show dry AMD arises from aborted vessel growth while wet AMD arises from excess growth in the same cell type.

Figure from the paper full image
abstract click to expand
The mechanisms that lead to dry Age-related Macular Degeneration are largely unelucidated, which prevents the introduction of effective therapies. Experimental support exists in the literature for the hypothesis that choroidal endothelial cell (ChEC) dysfunction precedes the loss of macular retinal pigmented epithelial (RPE), which may be only a secondary consequence of inadequate blood supply. If so, interventions at the level of ChEC could constitute an under investigated therapeutic strategy. Datasets regarding the transcriptional changes in early or intermediate dry AMD are publicly available, but for some some of them the information about ChECs have not been analyzed, or not analyzed using the most powerful and recent software tools. We present here new data generated by our bioinformatics analysis of these datasets. The main new finding is that angiogenesis is initiated in dry AMD, as it is in wet AMD. However, contrary to wet AMD, in dry AMD angiogenesis fails to execute, and therefore the blood supply that supports the RPE becomes gradually insufficient, leading to their dysfunctionality and death. The data support a unitary hypothesis of the origin / initiation / etiology of both dry and wet AMD, namely that both are initiated by ChEC dysfunction - either insufficient / abortive angiogenesis in dry AMD, or excessive angiogenesis in wet AMD. Pathway analysis also reveals as perturbed Notch and TNF signaling, endothelial to mesenchymal transition (EndoMT), mitochondria, "fluid shear stress", "osteoclast differentiation" and "calcification/osteoporosis". Overall, the new data provide a rationale for experimental studies, to validate and further characterize these perturbations, and investigate strategies to correct them.
0
0
cs.LG 2026-05-28

Method reconstructs single-cell geometry from unpaired data

by Ehtesamul Azim, Muhtasim Noor Alif +3 more

Geometry-First Generative Spatial Single-Cell Reconstruction

GEARS aligns scRNA-seq with ST to generate local geometries then solves global distance problem, improving distance preservation over grid-b

Figure from the paper full image
abstract click to expand
Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.
0
0
q-bio.GN 2026-05-25

Promoter-protein contrastive training beats gLMs on regulatory tasks

by Cameron Dufault, Scott Xu +1 more

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Aligning 88 million pairs gives multi-fold gains in annotation prediction and zero-shot co-regulation retrieval from genomes alone.

Figure from the paper full image
abstract click to expand
Despite the increasing scale of genome language models (gLMs), their ability to decode the function of regulatory sequences remains unclear. gLM pretraining relies on sequence reconstruction, which may struggle due to the noisy, rapidly evolving nature of regulatory DNA. Self-supervised contrastive approaches provide a promising alternative. Inspired by language-image architectures like CLIP, we introduce contrastive promoter-protein pretraining (C3P). By learning to align promoters to their corresponding proteins, we leverage the rich representations of proteins learned by protein language models as supervisory signal for the learning of promoter representations. After training on 88 million bacterial promoter-protein pairs, we evaluate the predictive power of C3P-learned promoter representations for inference of curated regulatory annotations, finding multi-fold improvement over leading gLMs. We also introduce zero-shot co-regulated gene retrieval, the ability to find co-regulated genes in a genome using no experimental data. We find that compared to a randomly initialized baseline, C3P training consistently provides significant zero-shot performance gains, unlike gLMs. Scaling analysis reveals the potential for further improvement as well as the efficiency of C3P, which achieved strong performance at a fraction of the training cost of leading gLMs. In addition to demonstrating that C3P training is effective for learning representations of bacterial regulatory sequences, our strong zero-shot co-regulated gene retrieval performance suggests the possibility of decoding gene regulation for millions of bacteria from their genomes alone.
0
0
q-bio.GN 2026-05-25

303-feature XGBoost reaches MCC 0.94 on ClinVar missense variants

by Muhammad Muneeb, David B. Ascher

AnnotateMissense: a genome-wide annotation and benchmarking framework for missense pathogenicity prediction

Trained on 132k labelled examples, the model holds MCC 0.76 on new temporal entries before scoring 90 million hg38 variants.

Figure from the paper full image
abstract click to expand
Missense variant interpretation remains challenging because pathogenicity depends on heterogeneous evidence from population frequency, evolutionary conservation, transcript context, amino acid substitution severity, prior pathogenicity predictors and protein-language-model-derived features. We present AnnotateMissense, a scalable annotation, benchmarking and genome-wide prediction framework for missense variant interpretation. AnnotateMissense integrates hg38 missense variants derived from dbNSFP v5.1 with ANNOVAR annotations, dbNSFP transcript/protein descriptors, AlphaMissense scores, ESM-derived features, conservation metrics, population-frequency variables, established pathogenicity predictors and engineered amino acid/codon-context features. Using 132,714 ClinVar-labelled missense variants, we benchmarked machine-learning and deep-learning models under controlled feature configurations. The full 303-feature benchmark set achieved the strongest performance with XGBoost, reaching mean MCC = 0.9411 and ROC-AUC = 0.9950 across stratified five-fold cross-validation. Restricted naive and location-oriented feature sets achieved lower best MCC values of 0.4989 and 0.5113, respectively. Circularity-controlled ablations showed that removing prior-predictor, population-frequency and clinically overlapping evidence reduced performance, whereas excluding AlphaMissense and ESM-derived features alone had minimal effect. Temporal ClinVar validation on newly observed pathogenic/benign variants achieved MCC = 0.7613, accuracy = 0.8798 and F1-score = 0.8750. The final model was applied to 90,643,830 hg38 missense variants to generate AnnotateMissense pathogenicity scores and binary prediction labels. Code and outputs are available at https://github.com/MuhammadMuneeb007/CAGI7_Annotate_All_Missense and https://doi.org/10.5281/zenodo.19981867.
0
0
q-bio.GN 2026-05-25 Recognition

Genetic short-sleep risk ties to higher obesity and diabetes odds

by Jiheum Park, Stephanie Y. Shue +5 more

Population-Specific Genetic and Non-Genetic Influences on Sleep Traits and Health Outcomes

Ancestry patterns emerge, yet measured sleep duration explains 85-99% of cross-sectional links in the data.

abstract click to expand
Sleep traits are shaped by genetic and environmental factors and may influence many health conditions. The All of Us Research Program, which includes EHR, physical measurements, genomic data, and wearable data across ancestry groups, provides an opportunity to study genetic and non-genetic contributors to sleep-related health outcomes. We examined associations between genetic predispositions to chronotype, sleep duration, and short sleep and health outcomes across ancestries, as well as the role of measured sleep duration. We used All of Us genome-wide association study results, including ancestry-specific and meta-analyses for 3,414 phenotypes, to identify phenotypes associated with 455 sleep-related SNPs. Cross-sectional and longitudinal analyses (n = 212,529) evaluated associations between polygenic risk scores (PRS) and anthropometric and metabolic measures from EHR. A subgroup analysis (n = 7,655) assessed sleep duration using Fitbit data. Across six ancestry groups, SNP analysis identified 61 phenotypes linked to 29 sleep-trait-associated SNPs. The chronotype SNP rs1421085 in FTO showed the strongest associations with obesity, diabetes, and cardiovascular conditions, mainly in European, American, and African groups. PRS analysis showed that higher predisposition to shorter sleep duration was associated with increased risk of obesity and diabetes, with ancestry-specific variation. Measured sleep duration attenuated these associations, with relative contributions of 85.6%-99.9% in cross-sectional analyses and 7.1%-44.0% in longitudinal analyses compared with PRS. This study identified health conditions associated with genetic predispositions to sleep traits and suggests that actual sleep duration may play a prominent role in sleep-related health outcomes. Differences among meta-, pooled-, and ancestry-specific analyses highlight the importance of population-specific research.
0
0
q-bio.GN 2026-05-22

Neural net extracts motifs separating WT from KO ATAC-seq peaks

by Lopamudra Dey

WTKO-CNN: Deep Learning Reveals Sequence Motifs Distinguishing Wild-Type and Knockout ATAC-seq Peaks

Saliency maps from a CNN classifier on chromatin data yield sequence patterns validated against known transcription-factor sites.

Figure from the paper full image
abstract click to expand
Chromatin regulators can alter transcriptional programs by modifying the accessibility of regulatory DNA elements. Understanding how regulatory sequences differ between wild-type (WT) and knockout (KO) conditions is crucial for deciphering transcriptional control. Here, we applied a convolutional neural network, \textbf{WTKO-CNN} with an attention mechanism to classify DNA sequences as WT or KO, achieving high predictive performance. To interpret the model, we generated saliency maps to identify nucleotide positions most influential for the classification decision. From these high-saliency regions, we extracted and clustered k-mers, enabling de novo motif discovery. Sequence logos and consensus motifs derived from the CNN filters revealed biologically meaningful patterns, which are further validated using MEME, TOMTOM, and HOMER against known transcription factor binding sites. Our analysis identified motifs associated with transcription factor families that discriminate WT from KO sequences, demonstrating that CNN-guided saliency mapping is a powerful approach for uncovering functional sequence features.
0
0
q-bio.GN 2026-05-21 1 theorem

Homomorphic encryption on blockchain computes private polygenic risk scores

by Kimon Antonios Provatas, Christos Galanopoulos +1 more

bioETH-PRS: Confidential Polygenic Risk Scoring without a Trusted Evaluator via Fully Homomorphic Encryption on a Programmable Blockchain

Four-contract protocol keeps genotypes and weights hidden while releasing only category outputs.

Figure from the paper full image
abstract click to expand
Polygenic risk scores (PRSs) aggregate genetic effect estimates to predict disease susceptibility, yet clinical deployment often exposes raw genotype data to third-party compute infrastructure. Prior homomorphic-encryption approaches, still require trust in a designated evaluator. We present bioETH-PRS, a protocol that replaces that evaluator role with immutable smart contracts on a blockchain supporting Fully Homomorphic Encryption (fhEVM). Using the integer-exact TFHE scheme, bioETH-PRS computes the PRS dot product entirely within the encrypted domain, keeping both genotype dosage vectors and GWAS weight vectors hidden from external parties throughout execution. We introduce a three-step fixed-point quantisation scheme for representing signed GWAS weights as unsigned 64-bit integers, achieving machine-epsilon reconstruction accuracy on validated fixtures. A four-contract architecture separates data custody, model publication, computation, and output release, and supports both a classic chunked path and a streaming path, with the latter reducing mock-measured gas by 37%. An on-chain noisy output oracle emits an encrypted noisy-score handle and a publicly decryptable ternary category, reducing raw score exposure and probing risk. Prototype evaluation on real GWAS fixtures confirms linear gas scaling and suggests that the approach may be cost-competitive in low-gas deployment environments.
0
0
cs.LG 2026-05-21 Recognition

Latent GP and optimal transport track cell changes over time

by Mehmet Yigit Balik, Harri Lähdesmäki

Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport

By aligning population distributions the framework infers trajectories and separates developmental asynchrony from cell-type paths in static

Figure from the paper full image
abstract click to expand
Single-cell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.
0
0
q-bio.GN 2026-05-21

Machine learning links lncRNA features to type 2 diabetes in two cohorts

by Ashwani Siwach, Sanjeev Narayan Sharma +1 more

Multi-Modal Machine Learning for Population- and Subject-Specific lncRNA-Type 2 Diabetes Association Analysis

Expression and sequence traits from MEG3 and other lncRNAs show cohort-specific associations while SHAP ranks MEG3 highest in both.

abstract click to expand
Long non-coding RNAs (lncRNAs) are emerging regulatory molecules implicated in chronic disease pathogenesis, including Type 2 Diabetes Mellitus (T2D). We investigated ten literature reported lncRNAs associated with T2D: MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, and HOTAIR across two independent population-based RNA-seq cohorts. Single-omics approaches provide an incomplete view of disease biology, therefore, an integrative multi-feature framework was developed, extracting expression, secondary-structure, and sequence features for each lncRNA. Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes to ensure robust performance estimation. SHAP analysis was applied for subject-level association interpretation. In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D, while MALAT1 expression and KCNQ1OT1, ANRIL, and MEG3 sequence features were found to be associated in the second cohort. MEG3 was identified by SHAP as the dominant lncRNA in both cohorts. ML results were consistent with established statistical methods while additionally providing population- and subject-level disease association profiles linked to specific molecular feature types. The proposed framework advances mechanistic understanding of T2D and supports lncRNA-based precision medicine.
0
0
q-bio.GN 2026-05-19 Recognition

Algorithm extracts active TF sites from mutation groups

by Doruk Efe Gökmen, Rosalind Wenshan Pan +5 more

Informational blueprints reveal condition-dependent gene regulatory architectures

Optimised filters across full promoters identify correlated mutations with biggest collective impact on expression under given conditions.

Figure from the paper full image
abstract click to expand
While coding regions in the genome have a direct interpretation in terms of protein products, significant fractions are non-coding and yet control essential biological functions. Unlike the genetic code, there is no "lookup table" that identifies where regulatory proteins, known as transcription factors (TFs), bind. Here, we extract these binding sites by distilling sequences of nucleotide letters into collective coordinates (hyperletters) representing the binding sites that are active under specific environmental conditions. Going beyond local information footprints between individual bases and expression levels, our $\textit{information blueprint}$ algorithm compresses the global information by optimising filters that simultaneously scan an entire promoter sequence. Inspired by renormalisation-group techniques, we identify TF binding sites as coarse-grained variables combining groups of correlated mutations with the highest collective impact on gene expression. We validate our approach on experimental data for $\textit{E. coli}$ and discover novel regulatory elements illustrating its deployment at scale across growth conditions.
0
0
q-bio.GN 2026-05-19

Geometry-aware bridges recover single-cell trajectories more accurately

by Chenglei Yu, Chuanrui Wang +2 more

PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference

Reduces average reconstruction error by 23.7 percent across seven datasets by penalizing off-tangent movement without cell pairings.

Figure from the paper full image
abstract click to expand
Single-cell trajectory inference from destructive time-course snapshots is fundamentally ill-posed: neither cross-time cell correspondences nor continuous trajectories are observed, so the snapshot distributions alone do not uniquely determine the underlying dynamics. Existing optimal transport and flow-based methods typically couple cells by Euclidean proximity at observed clock times, which can misalign trajectories when development is asynchronous and cells sampled at the same experimental time occupy different latent pseudotime stages. We propose PACE, a trajectory inference framework that recovers geometry-consistent continuous transport dynamics from destructive time-course snapshots through three coupled components. First, PACE constructs a state- and time-dependent anisotropic Riemannian metric that assigns low transport cost along locally supported tangent directions while penalizing normal velocity components. Second, it alternates between refining cross-time couplings under the induced path-action cost and fitting endpoint-preserving neural bridges between adjacent snapshots. Third, it distills the learned bridge dynamics into a global continuous-time velocity field over cellular states. Across seven controlled and biological datasets covering nine held-out reconstruction experiments, PACE achieves the strongest overall reconstruction performance, reducing MMD, Wasserstein-1 distance, and Wasserstein-2 distance by 23.7% on average relative to the strongest competing baseline. PACE also improves RNA-velocity alignment by 15.4% on an embryoid body differentiation benchmark, without requiring explicit cell pairing, lineage tracing, or RNA-velocity supervision during training. Code is available at https://github.com/AI4Science-WestlakeU/PACE.
0
0
q-bio.GN 2026-05-18 2 theorems

Protein-augmented diffusion predicts single-cell drug effects

by Peiting Shi, Ningfeng Que +3 more

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

StateXDiff combines RNA profiles with inferred proteins in a conditional diffusion model to generalize across new cell lines, drugs, and

Figure from the paper full image
abstract click to expand
Predicting drug-induced cellular state changes at single-cell resolution remains a central challenge in virtual cell modeling, particularly under out-of-distribution (OOD) conditions. Current approaches predominantly rely on RNA-based assays, which often fail to adequately capture the diverse cellular states underlying drug responses. Moreover, conditional distribution shifts and low signal-to-noise ratios frequently cause models to learn spurious correlations rather than genuine state transitions. To address these limitations, we introduce StateXDiff, a cell State-contextualized multimodal (X) Diffusion framework for predicting single-cell responses to drug perturbations. The framework operates sequentially: first, it learns a disentangled, multimodal representation of cellular state by integrating transcriptomic profiles with inferred protein features; second, it employs a conditional diffusion model to generate perturbation-specific changes. Our approach introduces a Virtual Multimodal Cell State, which augments RNA-based representations with protein-level context, and a Mechanism-aware Drug-Gene Template, which consolidates multi-source biological knowledge for accurate drug representation. Generation is driven by a latent-space diffusion Transformer, regularized through quality-aware triplet constraints, including positive drug-protein pairs or protein-drug mismatched pairs, and explicit protein-reliability weighting. Extensive evaluation demonstrates that StateXDiff consistently enhances generalization performance across three challenging settings: unseen cell lines, unseen drugs, and combinatorial perturbations.
1 0
0
cs.LG 2026-05-13 1 theorem

Resampling yields reliable networks from scarce high-dimensional samples

by Ziwei Huang, Zeyuan Song +2 more

A Resampling-Based Framework for Network Structure Learning in High-Dimensional Data

RSNet estimates partial correlations and signed graphlet structures via bootstrap and subsampling in R

abstract click to expand
RSNet is an open-source R package that provides a resampling-based framework for robust and interpretable network inference, designed to address the limited-sample-size challenges common in high-dimensional data. It supports both the estimation of partial correlation networks modeled as Gaussian networks and conditional Gaussian Bayesian networks for mixed data types that combine continuous and discrete variables. The framework incorporates multiple resampling strategies, including bootstrap, subsampling, and cluster-based approaches, to accommodate both independent and correlated observations. To enhance interpretability, RSNet integrates graphlet-based topology analysis that captures higher-order connectivity and edge sign information, enabling single-node and subnetwork-level insights. Notably, RSNet is the first R package to efficiently construct signed graphlet degree vector matrices (GDVMs) in near-constant time for sparse networks, providing scalable analysis of higher-order network structure. Collectively, RSNet offers a versatile tool for statistically reliable and interpretable network inference in high-dimensional data.
0
0
cs.LG 2026-05-13 2 theorems

Reeb graphs from diffusion detect single-cell shapes more accurately than baselines

by Andrew J Steindl, João Felipe Rocha +14 more

scShapeBench: Discovering geometry from high dimensional scRNAseq data

scShapeBench supplies synthetic and expert-labeled data so automated methods can match geometry to the right analysis pipeline instead of ad

Figure from the paper full image
abstract click to expand
High-dimensional point cloud data arise across many scientific domains, especially single-cell biology. The shapes or topologies of these datasets determine the types of information that can be extracted. For example, clustered data supports cell-type identification, trajectory structures support transition analysis, and archetypal structures capture continua of cellular behaviors. Existing analysis pipelines often assume a specific shape. The standard Seurat pipeline combines UMAP visualization with Louvain clustering and therefore assumes clustered data, while tools such as Monocle and SPADE assume tree-like structures, and flow-based models such as MIOFlow and Conditional Flow Matching target trajectories. Choosing which pipeline to apply is therefore often left to bioinformaticians who visually inspect datasets before selecting an analysis strategy. With the rise of agentic AI scientists, automating shape detection is increasingly important for selecting downstream analysis pipelines. To address this problem, we introduce scShapeBench, a benchmark dataset for shape detection containing both synthetic and expert-annotated single-cell datasets. Synthetic datasets are sampled from ground-truth skeleton graphs with controlled variance. Real single-cell datasets are curated from diverse sources and annotated by experts into four categories: clusters, single trajectory, multi-branching, and archetypal. We additionally introduce scReebTower, a baseline method that uses diffusion geometry to extract Reeb graphs and connect visualization with pipeline selection. We provide topology-aware evaluation metrics and compare scReebTower against PAGA and Mapper on synthetic and real data. Our results indicate that scReebTower outperforms existing baselines. Overall, our contributions span benchmarks, evaluation metrics, and a baseline for automated shape detection in single-cell data.
0
0
q-bio.GN 2026-05-13 Recognition

Genome embeddings predict microbiome abundances for novel species

by Younhun Kim, Georg K. Gerber +1 more

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Set-aggregated representations from genomic language models generalize better than classical bioinformatics methods on unseen genomes.

Figure from the paper full image
abstract click to expand
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.
0
0
cs.LG 2026-05-13 Recognition

Local discrete programs re-rank DNA edits for higher rewards

by Jeongchan Kim, Yunkyung Ko +1 more

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

A training-free operator solves small bounded problems around edit actions to steer variable-length DNA generators without retraining.

Figure from the paper full image
abstract click to expand
We study the application of recent Edit Flows for inference-time reward control for DNA sequence generation. Unlike most reward-guided DNA generation frameworks, which operate on fixed-length sequence spaces, Edit Flows have a potential to generate variable-length DNA through biologically plausible insertion, deletion, and substitution operations. In particular, we propose Local Perturbation Discrete Programming (LPDP), a training-free, intermediate-state and action-aware local re-solving operator for variable-length DNA edit-action generators at inference time. More specifically, at each guided rollout step, LPDP scores one-step root edits, retains a near-best root band, and re-ranks each retained root by solving a bounded local discrete program around its child sequence. This local program uses the typed geometry of edit actions to focus on coherent substitution, insertion, or deletion subgraphs, and aggregates local continuations with either a hard Max backup or a soft log-sum-exponential (LSE) backup. We instantiate LPDP in two regimes: front-loaded reward tilting for enhancer optimization, where early edits are critical for establishing global regulatory sequence structure, and back-loaded reward tilting for exon-intron-exon inpainting, where late edits fine-tune splice-boundary contexts.
0
0
q-bio.GN 2026-05-11 2 theorems

Nonlinear correction fixes RNA-seq sample biases

by Christopher Thron, Farhad Jafari

Detecting and Correcting Sample-by-Sample Scale Distortion in RNA Sequencing Data

Transforms based on local averaging reduce variance and lift subpopulation test sensitivity by 3-5 percent in simulations.

Figure from the paper full image
abstract click to expand
RNA sequencing (RNA-seq) is the conventional genome-scale approach used to capture the expression levels of all detectable genes in a biological sample. This is now regularly used for population-based studies designed to identify genetic determinants of various diseases. Naturally, the accuracy of these tests should be verified and improved if possible. In this study, we aimed to detect and correct for expression level-dependent errors which vary from sample to sample, and are not corrected by conventional normalization techniques . We examined several RNA-seq datasets from the Cancer Genome Atlas (TCGA), Stand Up 2 Cancer (SU2C), and GTEx databases with various types of preprocessing. By applying local averaging, we found sample by sample expression-level dependent biases in all datasets studied. Using simulations, we show that these biases corrupt gene-gene correlation estimations and $t$ tests between subpopulations. To mitigate these biases, we introduce two different nonlinear transforms based on statistical considerations that correct these observed biases. We demonstrate that that these transforms effectively remove the observed per-sample biases, reduce sample-to-sample variance, and improve the characteristics of gene-gene correlation distributions. Using a novel simulation methodology that creates controlled differences between subpopulations, we show that these transforms reduce variability and increase sensitivity of two population tests. The improvements in sensitivity and specificity were of the order of 3-5\% in most instances after the data was corrected for bias. Altogether, these results improve our capacity to understand gene-gene relationships, and may lead to novel ways to utilize the information derived from clinical tests.
0
0
q-bio.GN 2026-05-11 2 theorems

Protein embeddings classify bacterial operons at 0.71 ROC-AUC

by Akarsh Gupta, Kenneth Rodrigues +1 more

SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification

Siamese MLP on pre-trained models matches top DGEB entries while outperforming physicochemical baselines for scalable microbial genome work.

Figure from the paper full image
abstract click to expand
Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns a classifier over the fused embedding space, which is theoretically better motivated for binary classification, as cosine similarity can yield meaningless scores depending on the regularization of the embedding model. While protein language model embeddings substantially outperform physicochemical features in ROC-AUC, a learned Siamese MLP head does not significantly improve over unsupervised cosine similarity in Average Precision, suggesting that the geometry of the embedding space already captures the functional relationships needed for this task. Nonetheless, our Siamese MLP achieves a ROC-AUC of 0.71, competitive with state-of-the-art models on the DGEB leaderboard. These findings indicate that protein language model embeddings are a viable, scalable foundation for operonic pair classification across diverse microbial genomes, with implications for automated genome annotation, regulatory network reconstruction, and characterization of organisms lacking experimental operon annotations.
0
0
cs.LG 2026-05-11 2 theorems

Expert fusion model lifts microbial operon accuracy

by Seungik Cho

MicroFuse: Protein-to-Genome Expert Fusion for Microbial Operon Reasoning

Largest gains appear exactly where protein sequence misleads but genomic layout resolves true regulation.

Figure from the paper full image
abstract click to expand
Predicting microbial operon co-membership requires integrating two complementary biological signals: protein-scale molecular identity and genome-context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property -- protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein-to-genome expert fusion framework that integrates structure-aware protein representations from ProstT5 with genome-context representations from Bacformer through a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router. Training combines binary cross-entropy with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. We further construct OG-Operon100K, a 100,000-pair scaffold-level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG-Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5-only, Bacformer-only, and Concat MLP baselines. Ablations identify cross-modal contrastive alignment as the dominant component, and a hard sequence-conflict subset reveals MicroFuse's largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.
0
0
q-bio.GN 2026-05-08 2 theorems

Hybrid model lowers error in grapevine trait prediction across years

by Yibin Wang, Murukarthick Jayakodi +3 more

A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine

LiT-G2P blends additive genetic effects with Transformer nonlinear interactions to beat baselines on hair density and trichome density in 2-

Figure from the paper full image
abstract click to expand
Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.
0
0
cs.LG 2026-05-08

Logistic regression outperforms complex models on rare breast cancer subtypes

by Meena Al Hasani

Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data

In TCGA-BRCA gene expression tests, feature dimensionality affects performance more than choosing a complex model like random forest or SVM.

Figure from the paper full image
abstract click to expand
Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models. In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models were trained using varying numbers of highly variable genes (50 to 20,518). Performance was evaluated using stratified 5-fold cross-validation and assessed with accuracy and macro F1 score. While all models achieved high accuracy, macro F1 analysis revealed substantial differences in subtype-level performance. Logistic regression demonstrated the most stable and balanced performance across subtypes, including improved detection of rare classes. Random forest underperformed on minority subtypes despite strong overall accuracy, while SVM showed sensitivity to feature dimensionality. These findings highlight the importance of model simplicity, evaluation metrics, and feature selection in high-dimensional biological classification tasks.
0
0
cs.AI 2026-05-08 2 theorems

AI agent outperforms physicians on rare disease diagnosis

by Tianyu Liu, Wangjie Zheng +13 more

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Hygieia routes multi-modal genetic, phenotypic and clinical data to raise accuracy 12-60 percent while prioritizing risk genes and cutting 1

Figure from the paper full image
abstract click to expand
Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.
1 0
0
q-bio.GN 2026-05-08 Recognition

Multimodal LLM reasons with omics numbers and language together

by Maciej Sypetkowski, Joanna Krawczyk +5 more

OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning

OmicsLM matches specialized models on predictions and leads on multi-sample questions from real GEO studies.

Figure from the paper full image
abstract click to expand
Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.
0
0
cs.LG 2026-05-07 2 theorems

Causal GRN methods beat correlations only in clean data

by Miguel Fernandez-de-Retana, Ruben Sanchez-Corcuera +3 more

When Does Gene Regulatory Network Inference Break? A Controlled Diagnostic Study of Causal and Correlational Methods on Single-Cell Data

Isolated tests show dropout and confounders erase causal advantages over simple correlations in single-cell data

Figure from the paper full image
abstract click to expand
Despite theoretical advantages, causal methods for Gene Regulatory Network (GRN) inference from single-cell RNA-seq data consistently fail to match or outperform correlation-based baselines in many realistic benchmarks, a persistent puzzle which casts doubt on the value of causality for this task. We argue that existing benchmarks are insufficiently controlled to answer this question because they evaluate on real or semi-real data where multiple pathologies co-occur, confounding failure modes, and obscuring the specific conditions under which different inference methods excel or fail. To address this gap, we introduce a controlled diagnostic framework that isolates seven biologically motivated pathologies (dropout, latent confounders, cell-type mixing, feedback loops, network density, sample size, and pseudotime drift) and measure how six representative methods spanning three inference paradigms degrade as each pathology intensifies. Across 6,120 controlled experiments, we find that causal methods genuinely dominate in clean and structurally favorable regimes, but specific pathologies (notably dropout and latent confounders) selectively neutralize their advantages. We further introduce an error-type decomposition that reveals methods with similar aggregate accuracy commit qualitatively different errors. To probe whether single-pathology effects persist when multiple stressors co-occur, we perform an interaction sweep over the three most impactful pathologies and find that their joint effects are sub-additive, while also exposing density-conditional cross-overs invisible to single-dial analysis. Our findings offer a nuanced understanding of when and why different methods succeed or fail for GRN inference, providing actionable insights for method development and practical guidance for practitioners.
0
0
math.ST 2026-05-05

Moments of group functions computed from Fourier coefficients alone

by Matthew A. Herman, Stephen Doro

Statistics of a multi-factor function from its Fourier transform

Each moment expands into products of exactly m coefficients whose indices sum to zero, acting as a natural filter on contributing terms.

Figure from the paper full image
abstract click to expand
For a phenomenon $\boldsymbol{f}$ that is a function of $n$ factors, defined on a finite abelian group $G$, we derive its population statistics solely from its Fourier transform $\hat{\boldsymbol{f}}$. Our main result is an $m$-Coefficient/Index Annihilation Theorem: the $m$th moment of $\boldsymbol{f}$ becomes a series of terms, each with precisely $m$ Fourier coefficients --- and surprisingly, the coefficient indices in each term sum to zero under group addition. This condition acts like a filter, limiting which terms appear in the Fourier domain, and can reveal deeper relationships between the variables driving $\boldsymbol{f}$. These techniques can also be used as an analytical/design tool, or as a feasibility constraint in search algorithms. For functions defined on $\mathbb{Z}_2^n$, we show how the skew, kurtosis, etc. of a binomial distribution can be derived from the Fourier domain. Several other examples are presented.
0
0
q-bio.GN 2026-05-05

Transformer learns directional gene-program influences from unperturbed single-cell data

by Yuechen Wang, Lina Jia +2 more

ORBIT: Learning Gene Program Co-Activation Structure for Cell-Type-Stratified Pathway Rewiring Analysis in Single-Cell Transcriptomics

Intervention-consistent training on observational RNA-seq recovers Alzheimer's rewiring and classifies cell types nearly as well as the full

Figure from the paper full image
abstract click to expand
Gene programs co-activate within cells, but existing single-cell methods either treat programs independently or require experimental perturbation data to model their interactions. We introduce ORBIT, a self-supervised transformer that learns asymmetric dependencies among gene programs from observational single-cell RNA-sequencing data alone, quantifying how strongly each program influences every other program. The key mechanism is an intervention-consistent training objective: the model learns each program's directional influence on every other program by predicting how the others change when that program is removed, yielding attention weights that reflect asymmetric influence rather than symmetric co-occurrence. Applied to 191,890 prefrontal cortex nuclei across three pathway vocabularies, ORBIT recovers co-activation structure consistent with established Alzheimer's disease vulnerability signatures, identifies cell-type-specific rewiring invisible to differential expression, and achieves 0.984 macro F1 on cell-type classification from 220 pathway scores, which is within 0.3 points of a state-of-the-art classifier using all 22,088 genes.
0
0
q-bio.GN 2026-05-04

Data fusion lifts migraine prediction AUC from 0.644 to 0.688

by Muhammad Muneeb, David B. Ascher

EFGPP: Exploratory framework for genotype-phenotype prediction

Framework integrates genotype features, covariates and risk scores from migraine and depression GWAS to beat single sources in 733 UK Biobnk

Figure from the paper full image
abstract click to expand
Predicting complex human traits from genetic data is challenging because different genetic, clinical, and molecular data sources often contain different parts of the signal. Here, we present EFGPP, a reproducible framework for generating, ranking, and combining multiple types of data for genotype-to-phenotype prediction. We applied EFGPP to migraine prediction using UK Biobank data from 733 individuals. The framework combined genotype-derived features, principal components, clinical and metabolomic covariates, and polygenic risk scores generated from migraine and depression GWAS using PLINK, PRSice-2, AnnoPred, and LDAK-GWAS. The best single data type achieved a test AUC of 0.644, while combining multiple data types improved performance to 0.688 using migraine-focused inputs and 0.663 using cross-trait depression-derived inputs. Genetic features alone did not outperform the covariates-only baseline, but genotype-derived features performed better than PRS alone, and depression-derived PRS showed useful predictive signal. Overall, EFGPP provides a practical proof-of-concept framework for prioritising and integrating heterogeneous genetic data sources for complex phenotype prediction.
0
0
q-bio.GN 2026-05-04

Pipeline recovers 98.4% of known phenotype genes from 13 databases

by Muhammad Muneeb, David B. Ascher

PhenotypeToGeneDownloaderR: automated multi-source retrieval and validation of phenotype-associated genes

It validates 87.6% of symbols and supplies ready gene lists for risk scoring and variant interpretation.

Figure from the paper full image
abstract click to expand
Identifying phenotype-associated genes is a common first step in polygenic risk score construction, enrichment testing, target prioritisation and variant interpretation, but relevant evidence is distributed across heterogeneous databases with different interfaces, formats and evidence models. Here, we present PhenotypeToGeneDownloaderR, a phenotype-guided R/Python pipeline for automated gene retrieval, harmonisation, symbol validation and cross-source summary analysis. Given a phenotype term, the pipeline queries integrated biological databases, standardises per-source outputs, combines gene lists, validates retrieved symbols against the NCBI human gene reference and generates summary tables and visualisations. Across 13 clinically relevant phenotypes and 13 databases, PhenotypeToGeneDownloaderR generated 136,487 raw gene retrievals, with at least one source returning genes for every phenotype. Across all 13 phenotypes, 100,175 of 114,345 combined input symbols were retained after direct or synonym-based validation, corresponding to an 87.6\% validation rate. Cross-source overlap was low, supporting the complementarity of integrated evidence sources. Against an HPO/ClinVar/OMIM-derived gold standard, the pipeline recovered 1,039 of 1,056 known phenotype-associated genes, corresponding to 98.4\% recall. PhenotypeToGeneDownloaderR provides a lightweight, reproducible upstream framework for generating candidate gene sets for downstream prioritisation and interpretation. The pipeline is implemented in R and Python, released under the MIT licence, and available at https://github.com/MuhammadMuneeb007/PhenotypeToGeneDownloaderR.
0
0
cs.LG 2026-05-04

USB models cell birth and death as discrete single-cell jumps

by Junda Ying, Yuxuan Wang +3 more

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

Solves the branching Schrödinger bridge problem without simulations to reconstruct trajectories while capturing proliferation and apoptosis

Figure from the paper full image
abstract click to expand
Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schr\"odinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schr\"odinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.
0
0
q-bio.GN 2026-05-01

MCMC steering improves single-cell perturbation predictions

by Andac Demir, Erik W. Anderson +2 more

CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation

A Metropolis-Hastings sampler using masked conditionals avoids out-of-distribution artifacts when forecasting gene-knockout responses.

Figure from the paper full image
abstract click to expand
In this work, we introduce CellxPert, a scalable multimodal foundation model that unifies single-cell and spatial multi-omics within a common representation space. CellxPert jointly encodes transcriptomic (scRNA-seq), chromatin-accessibility (ATAC-seq), and surface-proteomic (CITE-seq) measurements, while directly incorporating MERFISH and imaging mass-cytometry data as 2D or 3D spatial-visual layers. CellxPert facilitates four key downstream tasks out of the box: (i) cell-type annotation across a broad ontology of 154 largely overlapping identities -- the largest label space addressed to date and a stringent test of fine-grained discrimination, (ii) efficient fine-tuning using Low Rank Adaptation (LoRA), (iii) genome-wide transcriptomic response prediction to in-silico perturbations (ISP), and (iv) seamless multi-omic integration across various assays and platforms. Unlike current single-cell foundation models, which approximate gene perturbations by deleting or reordering tokenized gene expression ranks, CellxPert employs a Metropolis-Hastings sampler whose proposal kernel uses the model's masked conditional distributions to transition to new transcriptomic states conditioned on the perturbed genes. This Markov-chain procedure mitigates out-of-distribution artifacts introduced by abrupt token manipulation and produces trajectories that are biologically interpretable. Evaluations on PBMC68K, Replogle Perturb-seq, Systema, and BMMC benchmarks show that CellxPert surpasses classical and state-of-the-art baselines in cell-type annotation, perturbation response prediction, and multi-omic integration.
0
0
q-bio.GN 2026-05-01

Fused signals and conformal calibration certify zero-miss DNA hazard screening under new-f

by Najmul Hasan

CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

Three public annotation signals combined and threshold-calibrated on leave-one-family-out folds bound expected miss rate at 5 percent with 0

Figure from the paper full image
abstract click to expand
DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le \alpha$. Across ten leave-one-taxonomic-family-out folds at $\alpha=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $\alpha=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan-code/crc-screen
0
0
cs.LG 2026-04-30

HyCNNs approximate quadratics with exponentially fewer parameters

by Shayan Hundrieser, Insung Kong +1 more

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Maxout units added to input-convex networks cut parameter needs while raising accuracy on regression and transport map tasks.

Figure from the paper full image
abstract click to expand
We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.
0
0
q-bio.GN 2026-04-29

A generalized gene co-expression method applied to AMD RNA-Seq data identifies stable…

by Brayan Gutierrez, Rinki Ratnapriya +1 more

Robust Clustering Analysis of Genes Related to Age-related Macular Degeneration using RNA-Seq

Enhanced clustering with stability checks recovers known AMD genes and surfaces fresh candidates for mechanism and therapy research.

Figure from the paper full image
abstract click to expand
Identifying genes associated with diseases is crucial to understanding disease mechanisms and developing therapies. However, identification of individual genes associated with a disease often needs to be supplemented with clustering analysis to understand the relationships between genes and identify gene modules beyond individual gene-level relationships. Gene co-expression networks are widely used as a graph theoretic approach to the clustering analysis of genes. In our work, we perform robust clustering analysis on RNA-Seq data of Age-related Macular Degeneration (AMD) patients and controls by generalizing one such framework, Multiscale Embedded Gene Co-Expression Network Analysis (MEGENA). We propose a carefully curated set of module quality evaluation metrics to choose appropriate statistical distance-based or information theoretic similarity measures over simple linear correlation to represent the similarities between genes. Furthermore, we design and implement a stability test to ensure the robustness of the detected hub genes in the presence of noise. Finally, we propose differential module eigengene analysis for a deeper understanding of upregulation and downregulation of each module with respect to the disease and control groups for a comprehensive understanding of the clustering analysis. Besides detecting robust hub genes and modules that are supported by prior findings, we also identify previously undiscovered hub genes that can potentially lead to further biomedical research into understanding the AMD disease mechanism and developing new treatments.
0
0
math.OC 2026-04-29

Metaheuristic selects reactions to fit GEMs on 9-28 media

by Philip Kilby, Sevvandi Kandanaarachchi +5 more

A Combinatorial Optimisation Approach to Multi-factorial Gap-filling in Genome-scale Metabolic Models (GEMs)

Using only continuous linear programs, the method improves Kendall Tau by 7.3% and RMS error by 13.3% over sequential single-medium gap-fll

Figure from the paper full image
abstract click to expand
Genome-Scale Metabolic Models (GEMs) describe the interactions between genes, proteins, and the biochemical reactions that underpin an organism's metabolism aiming to computationally simulate functions at the cellular level. While many metabolic reactions can be inferred from genome analysis, constructing GEMs often involves incorporating reactions unsupported by genomic data to improve prediction accuracy. This is known as gap-filling, a process that can be performed manually (a time-consuming task) or computationally. Traditional computational gap-filling approaches aim to correct GEM predictions for a single environmental condition (medium) by solving a large Integer Linear Programming problem. Sequential application across multiple media can produce a more robust model, but often introduces unrealistic predictions in other media. They are also slow to run. In this paper, we study multi-factorial gap filling, which aims to gap-fill GEMs across typically 10 or more input media simultaneously, while improving their overall predictive accuracy and minimising unrealistic behaviour. We view the selection of the set of reactions as a combinatorial optimisation problem, and describe a method based on classic metaheuristic approaches which requires the solution of continuous Linear Programming problems only. This paper provides an introduction of this problem to an audience whose speciality lies outside biology, and suggests a practical first-cut solution method. We demonstrate the method gap-filling GEMs for three bacteria strains, selecting 3000 to 4000 reactions from a database of more than 11000 reactions, while attempting to match the empirically measured performance on 9 to 28 separate media conditions. We show that our method outperforms conventional approaches on multiple metrics, including Kendal Tau and RMS Error by an average of 7.3% and 13.3%, respectively.
0
0
q-bio.GN 2026-04-29

Over 1000 TCR sequences flag long COVID vs recovery

by Zachary Montague, Rhea M Grover +4 more

T-cell repertoire response in individuals with post-acute sequelae of COVID-19

Repertoire analysis in 120 patients links motifs and clone changes to persistent symptoms after SARS-CoV-2 infection.

Figure from the paper full image
abstract click to expand
T-cells are central to SARS-CoV-2 clearance and immunological memory, yet their contribution to the persistence of post-acute sequelae of COVID-19 (PASC) remains poorly understood. The immunological features that distinguish individuals who develop PASC from those who recover fully are unresolved, in part due to the phenotypic heterogeneity of the condition and the likely multiplicity of its underlying mechanisms. Here, we profiled longitudinal bulk TCR$\beta$ repertoires from 120 individuals in the INCOV cohort--71 with PASC and 49 without--sampled at two to three time points spanning the acute and post-acute phases of infection. Using robust statistical modeling of repertoire composition and clonal dynamics, we found that global statistics such as V, J gene usage and CDR3 length do not differ between groups, but that locally enriched sequence motifs and differentially dynamic clones reveal distinct T-cell signatures associated with PASC status. Clones contracting following the peak of the acute response were significantly enriched for SARS-CoV-2 specificity in both groups. Interestingly, Influenza A-specific TCRs were disproportionately enriched among contracting clones in PASC{$^+$} repertoires, implicating viral co-infection as a potential contributor to early disease severity and, possibly, PASC pathogenesis. Rare public TCR clones were markedly enriched for SARS-CoV-2 specificity, with PASC{$^+$} individuals harboring a modestly but significantly higher proportion than PASC{$^-$} individuals. Together, we identified over 1,000 candidate TCR$\beta$ receptors potentially discriminating PASC{$^+$} from PASC{$^-$} immune responses, opening a path toward the identification of disease-relevant T-cell specificities and the development of T-cell-based immunological biomarkers for long COVID.
0
0
cs.LG 2026-04-28

Frozen confidence scores boost multi-omics cancer subtyping

by Boyang Fan, Hengchuang Yin +6 more

CMGL: Confidence-guided Multi-omics Graph Learning for Cancer Subtype Classification

Per-patient reliability estimates guide graph building, delivering 4 percent accuracy gains and enabling transfer between cancers.

Figure from the paper full image
abstract click to expand
Motivation: Multi-omics integration can improve cancer subtyping, but modality informativeness and noise vary across cancer types and patients. Existing graph-based methods optimize modality weights jointly with the classification objective and therefore lack independent reliability estimates, so low-quality omics distort patient similarity graphs and amplify noise through message passing. Results: We propose CMGL, a two-stage framework that estimates per-sample modality reliability through evidential deep learning and uses the frozen confidence scores to guide cross-omics fusion and graph construction. On four MLOmics cancer-subtype tasks and the 32-class pan-cancer task, CMGL consistently improves over the strongest baseline, surpassing it by 4.03% in average accuracy on the four single-cancer tasks. Its representations recover the PAM50 intrinsic subtypes of breast invasive carcinoma (BRCA), and the BRCA-trained model transfers without fine-tuning to kidney renal clear cell carcinoma (KIRC), stratifying patients into prognostically distinct groups.
1 0
0
q-bio.GN 2026-04-27

Radiomic features distinguish molecular subtypes in tongue cancer

by Hao Pan, Peipei Wang +10 more

Imaging Exploration of Molecular Subtypes in Tongue Squamous Cell Carcinoma

Ten wavelet texture measures from preoperative scans align with transcriptomic clusters that differ in immune and differentiation pathways.

abstract click to expand
Tongue squamous cell carcinoma (TSCC) is an aggressive malignancy with marked biological heterogeneity and variable clinical outcomes. Although molecular profiling has improved understanding of TSCC heterogeneity, its clinical use remains constrained by invasive tissue sampling and limited representation of whole-tumor spatial complexity. Meanwhile, most radiomics studies in TSCC have focused on downstream clinical endpoints, and whether imaging can non-invasively reflect intrinsic molecular subtypes remains unclear. In this study, an integrated transcriptomic-radiomics framework was used to investigate the relationship between preoperative imaging phenotypes and molecular subtypes in TSCC. Transcriptomic data from 60 TSCC cases in The Cancer Genome Atlas were analyzed using unsupervised consensus clustering, followed by differential expression and functional enrichment analyses. Matched preoperative imaging data from The Cancer Imaging Archive were manually annotated for primary tumor regions, and radiomic features were extracted using PyRadiomics; group differences were assessed with the U-test. Two stable molecular subtypes, C1 and C2, were identified. Their biological differences were mainly associated with squamous epithelial differentiation, inflammatory signaling, and lipid metabolism, with C2 showing greater enrichment of immune-related pathways. In addition, 10 radiomic features differed significantly between the two subtypes, mainly wavelet-derived texture features from gray-level size zone, dependence, co-occurrence, and run length matrices (P=0.00202-0.0162). These findings support the potential of radiomics as a non-invasive approach for characterizing molecular heterogeneity in TSCC and provide an initial radiogenomic framework for biologically informed preoperative assessment.
0
0
q-bio.GN 2026-04-27

Cathaya genome links defense gene loss to slow growth and symbiosis

by Yun Wang, Peng Xie +13 more

The Cathaya argyrophylla Genome Reveals the Evolutionary Trade-offs of a Living Fossil

Contractions in immunity pathways and transport expansions explain the living fossil's vulnerabilities and microbial dependence.

Figure from the paper full image
abstract click to expand
Cathaya argyrophylla is an endangered paleoendemic gymnosperm characterized by restricted ecological adaptability and high pathogen susceptibility. To elucidate its genomic architecture and evolutionary history, a de novo chromosome-level genome assembly was constructed using PacBio High-Fidelity long reads and Hi-C scaffolding. The resulting 22.73 Gb assembly resolves into 12 pseudochromosomes, demonstrating genome gigantism driven primarily by a 72.92 percent repeat sequence content and extensive intron expansion. Phylogenomic analysis using single-copy orthologs identifies C. argyrophylla as a sister lineage to the Pinus clade, with an estimated divergence time of 102.8 million years ago. Analysis of gene family dynamics reveals significant expansions in pathways related to membrane lipid metabolism, transmembrane transport, and translation machinery, indicating specific molecular adaptations for cellular homeostasis in resource-limited environments. Conversely, the genome exhibits massive contractions in endogenous defense networks, including plant-pathogen interactions, brassinosteroid signaling, and DNA repair mechanisms. This distinct genomic reduction correlates directly with the slow growth rate and weak innate immunity observed in the species, while the expanded transmembrane transport networks suggest an obligate physiological reliance on symbiotic microbiomes for survival. Ultimately, this reference genome establishes a critical molecular resource for future conservation and breeding programs.
0
0
q-bio.GN 2026-04-24

Supregraphs capture full read information in assembly graphs

by Anton Bankevich

Supregraph: Enabling Information-Optimal Assembly Graph Representation of a Read Set

The new graph type avoids data loss and forced breaks that plague existing methods, supporting optimal assemblies under natural assumptions.

Figure from the paper full image
abstract click to expand
The first step in any genome assembly algorithm entails the conversion from the domain of strings and overlaps to the language of graphs and paths, typically using one of the two conventional methods: de Bruijn graphs or overlap graphs. However, both standard approaches are known to have limitations. De Bruijn graphs fail to represent complete information from reads, while the overlap graphs often produce artificial breaks in contigs due to the necessity to discard contained reads as a preliminary step. In this work we present a mathematical model for genome assembly that provides a formal framework to determine what constitutes a correct conversion of a read set into an assembly graph under the assumption of error-free reads. We prove that a correct representation of a read set exists in the form of a new class of assembly graphs, which we call supregraphs. We show that supregraphs can be constructed by iteratively transforming de Bruijn graphs using the multiplexing procedure, previously employed in the genome assemblers LJA and Verkko. Finally, we demonstrate that, under a set of natural assumptions, supregraphs provide a foundation for constructing theoretically optimal genome assemblies.
0
0
cs.DC 2026-04-23

GPU runs 20,000 GWAS phenotypes in 20 minutes

by Xingzhong Zhao, Ziqian Xie +6 more

TorchGWAS : GPU-accelerated GWAS for thousands of quantitative phenotypes

TorchGWAS reuses the genotype matrix on one A100 to deliver 300- to 1700-fold higher throughput than CPU tools.

Figure from the paper full image
abstract click to expand
Motivation: Modern bioinformatics workflows, particularly in imaging and representation learning, can generate thousands to tens of thousands of quantitative phenotypes from a single cohort. In such settings, running genome-wide association analyses trait by trait rapidly becomes a computational bottleneck. While established GWAS tools are highly effective for individual traits, they are not optimized for phenotype-rich screening workflows in which the same genotype matrix is reused across a large phenotype panel. Results: We present TorchGWAS, a framework for high-throughput association testing of large phenotype panels through hardware acceleration. The current public release provides stable Python and command-line workflows for linear GWAS and multivariate phenotype screening, supports NumPy, PLINK, and BGEN genotype inputs, aligns phenotype and covariate tables by sample identifier, and performs covariate adjustment internally. In a benchmark with 8.9 million markers and 23,000 samples, fastGWA required approximately 100 second per phenotype on an AMD EPYC 7763 64-core CPU, whereas TorchGWAS completed 2,048 phenotypes in 10 minute and 20,480 phenotypes in 20 minutes on a single NVIDIA A100 GPU, corresponding to an approximately 300- to 1700-fold increase in phenotype throughput. TorchGWAS therefore makes large-scale GWAS screening practical in phenotype-rich settings where thousands of quantitative traits must be evaluated efficiently. Availability and implementation: TorchGWAS is implemented in Python and distributed as a documented source repository at https://github.com/ZhiGroup/TorchGWAS. The current release provides a command-line interface, packaged source code, tutorials, benchmark scripts, and example workflows.
0
0
q-bio.GN 2026-04-23

Tree-guided diffusion creates cell-specific DNA regulators

by Animesh Awasthi, Raphael Bednarsky +2 more

Conditional Monte Carlo Tree Diffusion for Designing Cell-Type-Specific and Biologically Faithful Regulatory DNA

The approach beats diffusion, autoregressive, and optimization baselines on specificity and natural sequence fidelity for human cell lines.

Figure from the paper full image
abstract click to expand
Designing regulatory DNA elements with precise cell-type-specific activity is broadly relevant for cell engineering and gene therapy. Deep generative models can generate functional gene-regulatory elements, but existing methods struggle to achieve high specificity against undesired cell types while adhering to the genome's natural regulatory grammar. Here, we introduce DNA-CRAFT, a generative framework that integrates class-conditioned discrete diffusion with Monte Carlo tree search to design cell-type-specific and biologically faithful regulatory elements. We first train a discrete diffusion model on the ENCODE registry of 3.2 million candidate regulatory elements. Second, we condition the model to learn class-specific regulatory grammars of naturally occurring DNA sequences, including enhancers and promoters. Third, we employ conditional Monte Carlo tree guidance, an inference-time alignment algorithm designed to maximize the differential regulatory activity between desired and undesired cell types. By benchmarking DNA-CRAFT on regulatory sequence design tasks for human cell lines and immune cell types, we demonstrate that our model generates sequences with high predicted cell-type-specific activity and biological fidelity, achieving the best trade-offs compared to methods that use diffusion, autoregressive models, and gradient-based optimization.
0

browse all of q-bio.GN → full archive · search · sub-categories