Language flips which jailbreaks work on frontier MLLMs
Spanish reduces role-play success but increases visual attack success, reversing safety rankings across models.
full image
Computation and Language
Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.
Spanish reduces role-play success but increases visual attack success, reversing safety rankings across models.
full image
EntmaxKV: Support-Aware Decoding for Entmax Attention
When selected pages capture the entmax support, sparse decoding matches the full version exactly and error vanishes with the dropped mass.
full image
MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
Two-stage pipeline resolves dialogue ambiguity before detecting objects in dynamic VR scenes
full image
LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning
LACUNA places PII in known weights so researchers can measure whether methods erase knowledge at the source or only change outputs.
full image
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
A 0.6B interpreter running compiler-generated LoRA adapters matches a 32B model on fuzzy text tasks at 1/50th the memory, entirely offline.
full image
Online Safety Monitoring for LLMs
Risk-calibrated thresholding on external verifier signals performs competitively on reasoning and red teaming tasks.
full image
Dual-channel tests show relational pressures create decision divergence absent from isolated prompts
full image
Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
Multimodal tool-use lets it handle short lines where audio alone fails, on a new 532K-line benchmark.
full image
Towards Robustness against Typographic Attack with Training-free Concept Localization
Sampling attribution finds lexical-encoding circuits in ViT; direct weight adjustments raise accuracy on attacked images without retraining.
full image
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
RL training forces models to correct errors using visual evidence instead of text patterns alone.
full image
Audio-Based Understanding of Audiobook Narration Appeal
Vocal features extracted from recordings remain tied to view-rate and engagement after title controls are applied.
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
TestEvo-Bench uses real commit data and execution checks to measure agent success at 77 percent on generation and 74 percent on updates.
full image
Will Scaling Improve Social Simulation with LLMs?
Tests on 120 models show rapid gains for common opinions yet slower or absent progress on forecasts and risk aversion.
full image
Language Models as Measurement Apparatus for Culture
The apparatus of model, data, annotation, and evaluation draws boundaries that define what counts as cultural reality.
full image
EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments
Benchmark tests iterative policy editing in 16 RL environments and finds top models succeed by discovering task mechanisms under budget cons
Agreement hits 0.89 on basic questions but drops at higher complexity levels, guiding when to mix AI and human review.
full image
Established authors lose 19pp at main ACL tracks; new authors raise ML share from 5% to 21% due to citation premiums.
full image
Know Your Source: A Public Knowledge Store for Media Background Checks
Enables reproducible LLM evaluations across 200 sources and improves credibility assessment quality
full image
HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation
Signal-guided workflow beats team's linear baseline by more than five points on official metric
full image
World Wide Models: Literary Tools for Cultural AI
Concepts like macrostructure and untranslatability from world literature address the monolingual limits of current models.
SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces
SkillFuzz uses planning artifacts and guided search to flag risky skill pairs before execution, with over 80 percent later confirmed.
full image
HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report
A lightweight statistical check triggers exact fallback only when the heuristic result may be unreliable, preserving average speed.
full image
On the Role of Directionality in Structural Generalization
CCG parser beats prior best on SLOG directional tests; larger encoders then close the recursion gap by addressing a separate weakness.
full image
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
Equal coverage rule boosts 16-task average by 0.0253 at one prefix length but loses edge when pools shrink 5x.
full image
CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning
Extracting claims and retrieving external facts stops small mistakes from growing across long outputs at modest cost.
full image
BamiBERT: A New BERT-based Language Model for Vietnamese
Trained on 129GB raw text for 20 epochs with 2048-token context, it outperforms prior base models without word segmentation.
AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
Fresh prompts built only from typed retrieval keep context fixed and let any memory layer be tested alone across hundreds of decisions.
Review of 33 papers finds overtrust and single-model reliance, urging improved validation for low-resource languages.
full image
A single speech pre-training round plus the text tuning delta yields capable speech instruction followers without dedicated speech tuning da
full image
Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation
Stochastic masking shifts Bayesian uncertainty estimation to the lightweight adapter ranks, keeping reasoning accuracy intact.
full image
HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
HaloGuard 1.0 beats 27B baselines on seven prompt-safety tests while holding low error rates across 46 languages.
full image
SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses
New 500-prompt benchmark shows models differ sharply in cultural fit, and human judges diverge from AI raters on grounding quality.
full image
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
Controlled sets with benign, dual-use and malicious versions of the same task expose hidden inconsistencies missed by standard tests.
full image
PACE: A Proxy for Agentic Capability Evaluation
Regression on a selected subset from 19 cheap benchmarks forecasts 4 agentic targets with under 4% error and over 0.8 correlation.
full image
EduArt: An educational-level benchmark for evaluating art history knowledge in large language models
Real exam questions reveal format controls whether LLMs can deploy knowledge or only recognize answers.
full image
Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words
Predictions exceed chance at token level and turn normalized pitch into realistic millisecond contours.
full image
Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing
The split improves cross-modal transfer while limiting leakage to unrelated inputs and keeps per-edit cost constant.
full image
Object Aligner approximates bijections with color refinement so LLM outputs can be scored without label sensitivity
full image
Towards a Phonology-Informed Evaluation of Multilingual TTS
A classifier trained on human speech flags when synthesized output loses the ATR distinctions that mark grammatical forms.
full image
Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing
Parser-agnostic edits break discourse cues, and many errors remain unfixable even after repeated attempts.
full image
NAVER LABS Europe Submission to the Instruction-following 2026 Short Track
SpeechMapper projector and LLM-generated scientific talks let a smaller system with weaker backbone outperform prior best.
full image
Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism
Behavioral tests alone cannot tell whether a model understands doubt or simply does not detect it.
full image
PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation
Online optimization of the velocity field supplies physically grounded futures to a cross-attention policy on a new 16-task benchmark.
full image
AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations
New dataset of 1,639 explanations with risk annotations lets small models audit AI teaching content privately.
full image
TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B
Visible <think> block becomes consistently Turkish after supervised fine-tuning on 16k examples, with RL recovering some math accuracy but n
full image
Lexical dependencies average longer and vary with word order while functional ones stay near distance 1.71 across 122 languages.
full image
Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters
Truncating loss support at the first predicted failure aligns training with inference acceptance without changing the model or pipeline.
full image
Driver and Navigator roles using compiler and renderer feedback lift Blender executability from 0.20 to 0.78 and TikZ rates by 10-30 points.
full image
SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
SkillCoach evaluates process quality separately from task success to better train and assess reusable skills in LLM agents.
Safety Targeted Embedding Exploit via Refinement
A gradient-guided method translates refusal words into low-resource languages, achieving up to 96.7% attack success on benchmarks and transf
full image
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
Tests find fixed-size and recursive splitting match or exceed semantic clustering when RAGAs scores answers from academic documents.
full image
Analysis of 5,281 papers from 81 countries shows unique national profiles narrowing over time.
full image
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
Gap persists as scores rise only gradually from 75% in early 2025.
full image
Study of three journals shows method choices differ by gender even across same topics, suggesting design influences.
Self-Supervised Test-Time Tuning for Packet Loss Concealment
Self-supervised synthetic masking on arrived signals improves concealment of true losses without extra data or model changes.
full image
On the Limits of Steering Vectors for Preference-Aligned Generation
Effectiveness varies by trait, drops when transferred to new tasks, and falls further with added vectors.
full image
Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
Perturbation tests reveal narrow trust regions around training examples, with in-context tuning offering only partial relief.
full image
PARTREP: Learning What to Repeat for Decoder-only LLMs
A lightweight gate on early hidden states selects only high-NLL tokens to repeat, matching full-prompt benefits across eight benchmarks at l
full image
Subliminal Clocks: Latent Time Modelling in Diffusion Language Models
Probes read the internal timestep signal and steering along it alters output confidence and entropy.
full image
Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training
SDPO shows more drift and collapse than GRPO, suggesting on-policy data alone is not enough to stabilize post-training.
full image
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
The method matches real domain text performance without synthetic pairs while keeping language model generation behavior.
full image
Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing
Benchmark shows trained methods already ignore rendering artifacts more selectively than traditional comparisons in web testing pipelines.
full image
The law fitted on low budgets predicts high-budget results, yet source expansion is preferable when total samples are matched at scale.
full image
Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
A one-time trained module achieves 91 percent correct identification and resists reversal by later training.
full image
AgenticDataBench: A Comprehensive Benchmark for Data Agents
AgenticDataBench draws real tasks and clustered skills from Stack Overflow to measure agent performance at fine granularity.
full image
Selective TMR on critical partitions via partial reconfiguration outperforms static and reactive methods on composite latency-energy-reliabi
full image
BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems
New protocol finds communication cuts divergence by roughly 20 percent in controlled GPT-4o tests.
full image
PASE generates recovery plans from semantic primitives, simulates them for safety, and adapts prompts with DRL to handle unknown faults fast
full image
ADVENT: LLM-Driven Automatic Predicate Invention for ILP
Automatic creation of reusable rules via LLM and Prolog loop enables cross-task learning on relational data.
full image
It separates pedagogical content from exposure content more accurately than baselines and matches human ratings at r=0.958.
full image
DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents
DiPS critic picks policies from recent utterances and beats zero-shot LLMs in simulated and real fire-rescue talks
full image
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
Length-aware softmax and sparse attention let a 0.6B model equal or beat vector retrieval on MS MARCO, NQ and triple scores on LIMIT.
full image
Multi-Head Recurrent Memory Agents
Architectural split into shielded heads prevents overwriting and lifts accuracy on long contexts without training or extra tokens.
full image
Parameter Golf: What Really Works?
Analysis of 1,430 submissions shows most techniques lose effect in competition, isolating only a few reliable gains.
full image
From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages
Multilingual training improves results further while language embeddings aid cross-corpus robustness but not in-domain accuracy.
full image
Comparing Architectures for Supervised Political Scaling
Experiments find that predicting ideological positions together beats separate predictions and bridges classification with regression.
full image
Prompt grounding alone reaches zero detections at low temperature; full layers maintain low rates across models and temperatures.
full image
On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
Utility and reliability hold at moderate ratios but degrade rapidly outside the domain or at extremes, making utility checks alone insuffici
full image
FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning
FaithMed applies clinician rubrics at each reasoning step inside reinforcement learning, beating outcome-only training on seven benchmarks.
full image
Isomorphic problem pairs reveal reasoning improvements are mostly knowledge-dependent rather than structure-invariant.