Low dimension suffices for near-max retrieval margins
Is Dimensionality a Barrier for Retrieval Models?
Dimension O(m^{-2} log n) nearly matches the infinite-dimension margin for any relevance matrix A.
full image
Information Retrieval
Covers indexing, dictionaries, retrieval, content and analysis. Roughly includes material in ACM Subject Classes H.3.0, H.3.1, H.3.2, H.3.3, and H.3.4.
Is Dimensionality a Barrier for Retrieval Models?
Dimension O(m^{-2} log n) nearly matches the infinite-dimension margin for any relevance matrix A.
full image
By routing tasks to specialized agents and verifying against physics models, the system generates auditable plans that simulations indicate
full image
Bringing Agentic Search to Earth Observation Data Discovery
Zero-shot LLM stage added to neural-BM25 fusion improves retrieval without extra training on NASA EO queries.
full image
HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report
A lightweight statistical check triggers exact fallback only when the heuristic result may be unreliable, preserving average speed.
full image
Planning over Matrix-Factorization MDPs for Candidate Generation
One lookahead step on matrix-factorization state updates improves recommendations on multiple datasets without retraining.
full image
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
Tests find fixed-size and recursive splitting match or exceed semantic clustering when RAGAs scores answers from academic documents.
full image
Real-world experiments show prior queries outperform aggregates and static profiles when inferring gender, age, category and size intent.
full image
CoPersona: Collaborative Persona Graphs for Robust LLM Personalization
CoPersona decomposes histories and borrows peer signals at the facet level to handle uneven coverage and improve personalization.
full image
Architecture optimization at two levels combined with zero-shot prompts aligns user preferences to item attributes on four datasets.
full image
Diffusion-GR2: Diffusion Generative Reasoning Re-ranker
Three-stage training closes validity and distribution gaps so block-parallel decoding matches autoregressive performance on Amazon Beauty.
Trie-based Experiment Plans for Efficient IR Pipeline Experiments
A data structure shares early retrieval results across variant pipelines, cutting total evaluation time on MSMARCO v2 while keeping effectiv
full image
MemSyco-Bench: Benchmarking Sycophancy in Agent Memory
The benchmark checks if agents over-align with users via memory instead of using facts or evidence.
full image
As It Was: Aligning LLM Search Evaluation with Historical User Preferences
QRI cards from past interactions raise correlation with user preferences and A/B winners while cutting errors on ambiguous cases.
full image
RACORN-1: Adaptive Recall-Preserving Speedup for Low-Selectivity Filtered Vector Search
RACORN-1 recovers 0.77-0.98 recall and cuts latency 9-26x versus HNSW by reusing filter-failing nodes as bridges
full image
At matched repair budget, triggering fixes on probe signal raises minimum recall@10 by 1.4 to 5 percent under bursty churn.
full image
The diagnostic tracks whether gold answers survive into the packed reader context and shows when submodular packing raises accuracy at fixed
full image
Multi-Turn Agentic Scientific Literature Search via Workflow Induction
PaperPilot induces editable DAGs of search operators to refine evolving queries across turns
full image
When RAG Meets Query Planning: Logical Query Trees for Resolving Exploratory Reasoning Problems
Decomposing problems into optimized trees with dynamic programming enables parallel execution that outperforms iteration and graph-based met
full image
Real-Time Hard Negative Sampling via LLM-based Clustering for Large-Scale Two-Tower Retrieval
The method draws challenging negatives from media clusters in real time and beats standard batch sampling at large scale.
Attribute-Prompted Kernel Hashing for Unsupervised Data-Efficient Cross-Modal Retrieval
Prompt optimization and kernel smoothing align modalities in Hamming space, generalizing from few seen pairs to unseen categories.
full image
Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval
Splitting focus on changes then semantic completion improves zero-shot retrieval of modified images on benchmarks without labeled triplets.
full image
Black-box attacks identify the model from unordered document sets even when a reranker is present.
full image
Violations of answer support, question clarity, and visual requirements distort model rankings and overstate capabilities.
full image
Semantic ID training, distilled reasoning, and conditional-reward RL target the final stage of recommendation systems.
An Open-Source Tool for Reproducible Freeway Network Extraction from OpenStreetMap
Extracts from OpenStreetMap, resolves interchange and ramp issues, and validates against imagery across 360 miles in California.
full image
ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping
ShopX removes separate retrieval steps and improves handling of complex requests in agentic workflows.
Unsupervised Data-Efficient Cross-Modal Retrieval with Global-Neighborhood Alignment Hashing
GNAH uses global and stochastic neighborhood alignment to keep retrieval performance while cutting required image-text pairs.
Co-occurrence clustering improves coverage by 17 points and shrinks the knowledge base to one fifth its size.
full image
Analysis of 26,000 articles finds move to empirical methods and user-centered topics with changing method-topic links.
full image
Building a Multimodal Dataset of Academic Paper for Keyword Extraction
Experiments on a new 1000-paper dataset show gains when models receive text from all three channels together rather than the document text a
Entity-combination analysis shows mixed academic-industrial groups lead on method-metric novelty while industrial groups lead on method-tool
full image
GenPage: Towards End-to-End Generative Homepage Construction at Netflix
The model replaces the multi-stage recommender, raises the core metric 0.24 percent, and cuts latency 20 percent in live tests.
full image
AGE: Adaptive-masking for Graph Embedding in Graph Retrieval-Augmented Generation
AGE trains a Transformer to predict surrounding nodes while skipping dominant ones, closing the gap with text encoders and lifting accuracy
full image
Towards Critical IR Theories and Practices
Nondomination as explicit goal supplies the theoretical base Belkin and Robertson urged for limiting contradictory research.
Information Terra: A Narrative-Anchored Semantic-First Projection of Document Embeddings
Latitude tracks geodesic progress between endpoints while longitude shows thematic deviation in the embedding space
full image
Standard fine-tuning makes retrieval depend on serialization order; PI-FT binds meaning to labels instead and works on a new 15-language ben
full image
ENC-ODE: Event-level Neurodegenerative Modeling in Continuous Time with Neural ODEs
Diagnosis-conditioned dynamics and target attention aggregate full event history to improve predictions on sparse ADNI data.
full image
Research Entity Extraction and Topic Detection from UKRI Grant Proposals
Outperforms DSIT-Taxonomies pipeline on 42 proposals and offers a secure route to scan funding data for emerging areas.
full image
Query-Aware Spreading Activation for Multi-Hop Retrieval over Knowledge Graphs
Fixed-step propagation inside the database matches complex solvers on multi-hop QA while cutting latency and memory use.
full image
Efficient Retrieval-Augmented Generation via Token Co-occurrence Graphs
TIGRAG builds graphs from sliding-window statistics and uses iterative entity expansion to beat dense and LLM-graph baselines on retrieval a
full image
Behind the Content: Wikipedia Mobile Views and Tourism Activity
Pageviews from phones on city pages align with hotel bookings and attraction attendance in France while desktop traffic does not.
full image
From Extraction to Navigation: Progressive Retrieval with Indirectly Infinite Depth
Goal-aware navigation and recursive state evolution overcome search drift in billion-scale recommenders while keeping precision high.
full image
Know Before You Fetch: Calibrated Retrieval-Budget Allocation for Retrieval-Augmented Generation
Out-of-fold calibration turns log-probabilities into correctness estimates that improve graded decisions on TriviaQA, NQ and MS MARCO.
full image
Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation
Standard retrievers include the target in only 5-23% of realistic pools, and LLMs fail to beat baselines even when the item is present.
full image
POEM: Partial-Order Enhanced Real-Time Sequential Modeling for Recommendation
Dynamic grouping of CTR and duration predictions captures instant shifts and raises per-user watch time by 0.2 percent on Kuaishou pages.
full image
SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics
SABER-Math uses LLMs on 283K problems to show embeddings beat baselines but falter on algebra and calculus
full image
Full-text analysis shows improvement rarest, use replacing description over time, and machine learning algorithms cited differently than gra
Z-scores from entity networks in 21st-century papers show methods dominate and new tech spreads faster than before.
full image
Mandol: An Agglomerative Agent Memory System for Long-Term Conversations
Agglomerative graph replaces separate databases to cut I/O latency and retrieval noise in long conversations.
full image
Do Recommendation Algorithms Work When Users Are LLM Agents? A Case Study on Moltbook
On Moltbook, simple co-occurrence and vote-count methods predict agent forum engagement better than any user-modeling approach.
full image
Diagnosing and Mitigating Context Rot in Long-horizon Search
Tests on four models show rot worsens with length; pruning and filtering cut the problem.
ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering
In settings with fragmented technical evidence, adapting only the retriever side lifts both retrieval accuracy and answer quality.
full image
Experiments show dense methods keep 91 percent nDCG on consumer hardware while keeping sensitive files private.
full image
Metadata, Structure, or Strategy? A Decomposition of RAG Context Enrichment
Tests across benchmarks show that whether models can use added information matters more than how much is added
full image
mamabench supplies 25,949 QA items and mamaretrieval supplies 3,185 graded queries over 63k guideline chunks, both drawn from expert sources
Monosemanticity in Recommender Systems
Matryoshka sparse autoencoders applied to matrix factorization on fashion data reveal gender-linked factors for intervention.
full image
Covering the Unseen: Information Demand Coverage Optimization for Retrieval-Augmented Generation
By matching selected passages to a multi-dimensional demand distribution instead of a single query embedding, exact match scores rise 6.5-7.
full image
An Information-Geometric Justification for Composite Coherence in Event-Based Narrative Extraction
It produces an additive cost split on the embedding-topic product manifold and induces a well-defined product distance.
full image
AB-RAG: Adaptive Budgeted Retrieval-Augmented Generation for Reliable Question Answering
AB-RAG forms a confidence score from certainty, agreement and variance to decide when to stop retrieving and when to trust the answer.
full image
Fairness Attacks on Recommender Systems
Graph encoder and RNN policies select items and genders so injected data shifts fairness metrics on real datasets and multiple models.
full image
Linear probes recover the signal at 71.8 AUROC while explicit answers drop 25 points lower across 22 models
full image
Human-in-the-Loop Nugget Annotation for Accountable LLM-as-a-Judge Evaluations
Prototype divides labor so humans decide what matters and machines handle volume while preserving oversight.
full image
Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation
Fine-tuned encoders and a weighted layer improve both set accuracy and cost-controlled utility for prompts that need several agents from a 1
full image
Multimodal Graph RAG for Long-range Visually Rich Document Understanding
The method summarizes global visual-text knowledge to answer questions needing entire documents, beating page-retrieval approaches.
full image
Reproducing FACTER: Fairness via Conformal Thresholding and Prompt Repair
Reproduction shows iterative prompt repair adds limited benefit over static fairness instructions when candidate sets are fixed.
full image
R²-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agentic Search
Token-guided evidence extraction and post-retrieval reflection create a self-improving loop that raises accuracy on seven complex benchmarks
full image
CMSL: Constructive Multi-Sequence Learning for Recommendation Systems
CMSL builds coherent thematic strands from history instead of ingesting one noisy chronological list.
Context-Aware Explanations for Spatialized Document Layouts
CAPE generates natural-language descriptions grounded in layout patterns such as clusters and outliers, rated more helpful than content-only
full image
Single and Multi Truth Data Fusion using Large Language Models
Experiments on three benchmarks show prompting strategies handle single and multi-truth conflicts better than traditional unsupervised metho
full image
Fast and Feasible: Permutation-based Constrained Reranking for Revenue Maximization
PermR recovers 63% of the optimal reranking gain inside production latency on 56 million queries.
full image
Listwise Explanation of Embedding-Based Rankings via Semantic Chunk Grouping
ChunkGroupSHAP masks shared semantic clusters across documents to align listwise attributions with the granularity of embedding rankers.
full image
SHARD: cell-keyed residual splitting for alignment-resistant private dense retrieval
Splitting embeddings lets a public prefix drive lookup while C private cells each require their own key, scaling attacker effort with C.
full image
An LLM-Powered Semantic Alignment Framework for Journal Recommendation
Semantic matching of article content to journal scopes works without training data or historical records on a 23k-article statistics dataset
full image
GLAN adds L-RTG inter-day guidance and HRM session decomposition to handle non-Markovian behaviors and delayed rewards, lifting DAU and life
full image
SemFlowRAG: Directed Semantic Flow from Abstraction to Evidence for Complex Reasoning
A gradient graph with abstractness-based direction guides evidence from concepts to documents for better complex QA performance.
full image
End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference
L2A uses budget-aware gates for layer, head, and token sparsity, staying within 0.6% of dense at 34% sparsity on Llama and Qwen.
Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
R2LM pairs causal left attention with reverse Mamba right context for 2-13x throughput gains over bidirectional models.
Intuition-Guided Latent Reasoning for LLM-Based Recommendation
Initializing hidden-state reasoning with a preference-aligned candidate embedding produces more accurate preference trajectories.
full image
DysLexLens: A Low-Resource LLM Framework for Analysing Dyslexic Learners Insights from Online Forums
Dictionary filtering and knowledge-graph reasoning create verifiable answers from low-resource Reddit data on dyslexic learners.
full image
A Sensitivity-Aware Test Collection for Search Among Personal Information
150 queries and 11k assessments support systems that retrieve relevant emails without revealing private content
full image