archive

Every paper Pith has read. Search by title, abstract, or pith.

11171 papers in cs.CL · page 1

cs.CL 2026-07-02 reviewed

Testbed shows unlearning often misses the parameters holding data
LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Matteo Boglioni +4
cs.LG 2026-07-02 reviewed

Compile fuzzy functions into 23 MB weights
Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Wentao Zhang +5
cs.AI 2026-07-02 reviewed

Simple threshold monitor matches advanced LLM safety checks
Online Safety Monitoring for LLMs

Mona Schirmer +5
cs.AI 2026-07-02 reviewed

Social context produces 40% public-private split in LLM agents
What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Arman Ghaffarizadeh +3
cs.CL 2026-07-02 reviewed

Reasoning model boosts speaker ID accuracy in TV dramas
Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Yuxuan Li +8
cs.CV 2026-07-02 reviewed

Tuning a few attention heads blocks text attacks on vision models
Towards Robustness against Typographic Attack with Training-free Concept Localization

Bohan Liu +5
cs.CL 2026-07-02 reviewed

Masked prefixes and replay boost VLM accuracy on new images
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Liyan Tang +2
cs.CL 2026-07-02 reviewed

Narration acoustics predict audiobook appeal beyond title
Audio-Based Understanding of Audiobook Narration Appeal

Shahar Elisha +2
cs.SE 2026-07-02 reviewed

Benchmark tracks test changes after code commits
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Jiale Amber Wang +2
cs.CL 2026-07-02 reviewed

Scaling boosts most LLM social simulations but stalls on biases
Will Scaling Improve Social Simulation with LLMs?

Caleb Ziems +5
cs.CL 2026-07-02 reviewed

Language models shape the culture they measure
Language Models as Measurement Apparatus for Culture

Kent K. Chang
cs.AI 2026-07-02 reviewed

GPT-5.5 tops EvoPolicyGym on autonomous policy evolution
EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Zhilin Wang +15
cs.AI 2026-07-02 reviewed

Gemini matches experts grading bash commands with rubrics
Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Manuel Alonso-Carracedo +4
cs.CL 2026-07-02 reviewed

NLP authors shift from core conferences to ML venues
The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing

David Jurgens
cs.CL 2026-07-02 reviewed

Public document store replaces paid APIs for media background checks
Know Your Source: A Public Knowledge Store for Media Background Checks

Benjamin Nichols +2
cs.CL 2026-07-02 reviewed

Multi-agent routing scores 44.05 SARI on Spanish Easy-to-Read task
HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation

Lourdes Moreno +3
cs.CL 2026-07-02 reviewed

Literary tools can make AI culturally literate
World Wide Models: Literary Tools for Cultural AI

Nina Begus
cs.SE 2026-07-02 reviewed

Fuzzing spots 1000+ hidden intents in combined AI skills
SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

Jinwei Hu +3
cs.DB 2026-07-02 reviewed

HNSW searches gain worst-case correctness via spanner bounds
HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report

Minghao Li +5
cs.CL 2026-07-02 reviewed

Directed types lift position-shift accuracy 30 points over undirected algebra
On the Role of Directionality in Structural Generalization

Zichao Wei
cs.LG 2026-07-02 reviewed

Granularity hierarchy reveals mixing rule gain at one level only
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Ziyun Qiao +3
cs.CL 2026-07-02 reviewed

CheckRLM fixes factual errors inside AI reasoning chains
CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning

Dingling Xu +10
cs.CL 2026-07-02 reviewed

BamiBERT leads on 11 of 15 Vietnamese NLP metrics
BamiBERT: A New BERT-based Language Model for Vietnamese

Dat Quoc Nguyen +3
cs.AI 2026-07-02 reviewed

Bounded typed-retrieval memory raises agent wins from 3/10 to 6/10
AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

Xiangchen Cheng +9
cs.CL 2026-07-02 reviewed

LLM judges yield inconsistent results in multilingual settings
Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages

A.Seza Do\u{g}ru\"oz +5
cs.CL 2026-07-02 reviewed

Weight addition transfers instruction following to speech models
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

Congrui Du +3
cs.LG 2026-07-02 reviewed

LoRA rank masking calibrates uncertainty in LLMs
Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation

Jijie Zhang +3
cs.CL 2026-07-02 reviewed

0.8B open model tops safety benchmarks at 90.9 F1
HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety

Navaneeth Sangameswaran +2
cs.CL 2026-07-02 reviewed

Two LLMs weaken on Ukrainian crisis support while one stays stable
SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses

Anna Chorna
cs.CL 2026-07-02 reviewed

Benchmark shows models fail to calibrate safety across intent variants
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

Rheeya Uppaal +3
cs.AI 2026-07-02 reviewed

Curated non-agentic tests predict agentic scores at low cost
PACE: A Proxy for Agentic Capability Evaluation

Yueqi Song +10
cs.CL 2026-07-02 reviewed

Models ace multiple choice but fail open art history tasks
EduArt: An educational-level benchmark for evaluating art history knowledge in large language models

Gianmarco Spinaci +2
cs.CL 2026-07-02 reviewed

Embeddings predict Mandarin word durations above chance
Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Xiaoyun Jin +2
cs.AI 2026-07-02 reviewed

ScopeEdit splits MLLM edits into local and gated branches
Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing

Siyuan Li +4
cs.CL 2026-07-02 reviewed

JSON graph scorer stays invariant under identifier swaps
Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization

Jan Drchal
cs.CL 2026-07-02 reviewed

TTS systems neutralize Assamese vowel contrasts in one-third of cases
Towards a Phonology-Informed Evaluation of Multilingual TTS

Sneha Ray Barman +2
cs.CL 2026-07-02 reviewed

LLM rewriting hurts discourse parsing more than it helps
Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing

Yiming Liu +6
cs.CL 2026-07-02 reviewed

Compact model tops speech instruction track with synthetic data
NAVER LABS Europe Submission to the Instruction-following 2026 Short Track

Marcely Zanon Boito +3
physics.soc-ph 2026-07-02 reviewed

LLMs resist skepticism by failing to represent the signal
Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism

Minjong Cheon
cs.RO 2026-07-02 reviewed

Divergence-free 3D Gaussian field improves robot dynamic manipulation
PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

Peng Yun +5
cs.CL 2026-07-02 reviewed

Fine-tuned local LLM matches frontier models on K-12 risk detection
AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations

Javier Irigoyen +7
cs.CL 2026-07-02 reviewed

SFT and RL train Turkish reasoning traces into 27B model
TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

Baran Bingol +1
cs.CL 2026-07-02 reviewed

Grammar fixes short functional links in all languages
The Grammar Does the Work: Functional vs. Lexical Dependency Length Minimization Across Universal Dependencies

Kim Gerdes (LISN +2
cs.AI 2026-07-02 reviewed

AUF training lifts drafter emitted length from 2.40 to 2.61
Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters

Tianjian Yang +1
cs.CL 2026-07-02 reviewed

Pair programming agents raise code artifact success rates
PairCoder++: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation

Junhao Chen +11
cs.AI 2026-07-02 reviewed

Evolved rubrics reveal agent skill failures missed by accuracy checks
SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

Jiayin Zhu +6
cs.AI 2026-07-02 reviewed

English-only safety leaves LLMs open to low-resource language attacks
Safety Targeted Embedding Exploit via Refinement

Joshua Adrian Cahyono
cs.IR 2026-07-02 reviewed

Cluster chunking shows no gain over basic methods on thesis RAG
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts

Valentin J. J. Kreileder +2
cs.DL 2026-07-02 reviewed

LIS research methods differ by country but converge over 30 years
Non-synchronism in Global Usage of Research Methods in Library and Information Science from 1990 to 2019

Chengzhi Zhang +1
cs.AI 2026-07-02 reviewed

LLMs top out at 82.7% on aviation benchmark vs expert 95%
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Alex Brooker +1