archive
Every paper Pith has read. Search by title, abstract, or pith.
11171 papers in cs.CL · page 1
-
Testbed shows unlearning often misses the parameters holding data
LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning
-
Compile fuzzy functions into 23 MB weights
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
-
Simple threshold monitor matches advanced LLM safety checks
Online Safety Monitoring for LLMs
-
Social context produces 40% public-private split in LLM agents
What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates
-
Reasoning model boosts speaker ID accuracy in TV dramas
Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
-
Tuning a few attention heads blocks text attacks on vision models
Towards Robustness against Typographic Attack with Training-free Concept Localization
-
Masked prefixes and replay boost VLM accuracy on new images
Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
-
Narration acoustics predict audiobook appeal beyond title
Audio-Based Understanding of Audiobook Narration Appeal
-
Benchmark tracks test changes after code commits
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
-
Scaling boosts most LLM social simulations but stalls on biases
Will Scaling Improve Social Simulation with LLMs?
-
Language models shape the culture they measure
Language Models as Measurement Apparatus for Culture
-
GPT-5.5 tops EvoPolicyGym on autonomous policy evolution
EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments
-
Gemini matches experts grading bash commands with rubrics
Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
-
NLP authors shift from core conferences to ML venues
The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing
-
Public document store replaces paid APIs for media background checks
Know Your Source: A Public Knowledge Store for Media Background Checks
-
Multi-agent routing scores 44.05 SARI on Spanish Easy-to-Read task
HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation
-
Literary tools can make AI culturally literate
World Wide Models: Literary Tools for Cultural AI
-
Fuzzing spots 1000+ hidden intents in combined AI skills
SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces
-
HNSW searches gain worst-case correctness via spanner bounds
HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report
-
Directed types lift position-shift accuracy 30 points over undirected algebra
On the Role of Directionality in Structural Generalization
-
Granularity hierarchy reveals mixing rule gain at one level only
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
-
CheckRLM fixes factual errors inside AI reasoning chains
CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning
-
BamiBERT leads on 11 of 15 Vietnamese NLP metrics
BamiBERT: A New BERT-based Language Model for Vietnamese
-
Bounded typed-retrieval memory raises agent wins from 3/10 to 6/10
AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
-
LLM judges yield inconsistent results in multilingual settings
Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages
-
Weight addition transfers instruction following to speech models
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
-
LoRA rank masking calibrates uncertainty in LLMs
Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation
-
0.8B open model tops safety benchmarks at 90.9 F1
HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
-
Two LLMs weaken on Ukrainian crisis support while one stays stable
SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses
-
Benchmark shows models fail to calibrate safety across intent variants
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
-
Curated non-agentic tests predict agentic scores at low cost
PACE: A Proxy for Agentic Capability Evaluation
-
Models ace multiple choice but fail open art history tasks
EduArt: An educational-level benchmark for evaluating art history knowledge in large language models
-
Embeddings predict Mandarin word durations above chance
Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words
-
ScopeEdit splits MLLM edits into local and gated branches
Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing
-
JSON graph scorer stays invariant under identifier swaps
Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization
-
TTS systems neutralize Assamese vowel contrasts in one-third of cases
Towards a Phonology-Informed Evaluation of Multilingual TTS
-
LLM rewriting hurts discourse parsing more than it helps
Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing
-
Compact model tops speech instruction track with synthetic data
NAVER LABS Europe Submission to the Instruction-following 2026 Short Track
-
LLMs resist skepticism by failing to represent the signal
Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism
-
Divergence-free 3D Gaussian field improves robot dynamic manipulation
PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation
-
Fine-tuned local LLM matches frontier models on K-12 risk detection
AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations
-
SFT and RL train Turkish reasoning traces into 27B model
TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B
-
Grammar fixes short functional links in all languages
The Grammar Does the Work: Functional vs. Lexical Dependency Length Minimization Across Universal Dependencies
-
AUF training lifts drafter emitted length from 2.40 to 2.61
Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters
-
Pair programming agents raise code artifact success rates
PairCoder++: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation
-
Evolved rubrics reveal agent skill failures missed by accuracy checks
SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
-
English-only safety leaves LLMs open to low-resource language attacks
Safety Targeted Embedding Exploit via Refinement
-
Cluster chunking shows no gain over basic methods on thesis RAG
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
-
LIS research methods differ by country but converge over 30 years
Non-synchronism in Global Usage of Research Methods in Library and Information Science from 1990 to 2019
-
LLMs top out at 82.7% on aviation benchmark vs expert 95%
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge