pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

11171 papers in cs.CL · page 1

  1. cs.CL 2026-07-02 reviewed
    Testbed shows unlearning often misses the parameters holding data

    LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

    Matteo Boglioni +4

  2. cs.LG 2026-07-02 reviewed
    Compile fuzzy functions into 23 MB weights

    Program-as-Weights: A Programming Paradigm for Fuzzy Functions

    Wentao Zhang +5

  3. cs.AI 2026-07-02 reviewed
    Simple threshold monitor matches advanced LLM safety checks

    Online Safety Monitoring for LLMs

    Mona Schirmer +5

  4. cs.AI 2026-07-02 reviewed
    Social context produces 40% public-private split in LLM agents

    What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

    Arman Ghaffarizadeh +3

  5. cs.CL 2026-07-02 reviewed
    Reasoning model boosts speaker ID accuracy in TV dramas

    Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

    Yuxuan Li +8

  6. cs.CV 2026-07-02 reviewed
    Tuning a few attention heads blocks text attacks on vision models

    Towards Robustness against Typographic Attack with Training-free Concept Localization

    Bohan Liu +5

  7. cs.CL 2026-07-02 reviewed
    Masked prefixes and replay boost VLM accuracy on new images

    Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

    Liyan Tang +2

  8. cs.CL 2026-07-02 reviewed
    Narration acoustics predict audiobook appeal beyond title

    Audio-Based Understanding of Audiobook Narration Appeal

    Shahar Elisha +2

  9. cs.SE 2026-07-02 reviewed
    Benchmark tracks test changes after code commits

    TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

    Jiale Amber Wang +2

  10. cs.CL 2026-07-02 reviewed
    Scaling boosts most LLM social simulations but stalls on biases

    Will Scaling Improve Social Simulation with LLMs?

    Caleb Ziems +5

  11. cs.CL 2026-07-02 reviewed
    Language models shape the culture they measure

    Language Models as Measurement Apparatus for Culture

    Kent K. Chang

  12. cs.AI 2026-07-02 reviewed
    GPT-5.5 tops EvoPolicyGym on autonomous policy evolution

    EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

    Zhilin Wang +15

  13. cs.AI 2026-07-02 reviewed
    Gemini matches experts grading bash commands with rubrics

    Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

    Manuel Alonso-Carracedo +4

  14. cs.CL 2026-07-02 reviewed
    NLP authors shift from core conferences to ML venues

    The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing

    David Jurgens

  15. cs.CL 2026-07-02 reviewed
    Public document store replaces paid APIs for media background checks

    Know Your Source: A Public Knowledge Store for Media Background Checks

    Benjamin Nichols +2

  16. cs.CL 2026-07-02 reviewed
    Multi-agent routing scores 44.05 SARI on Spanish Easy-to-Read task

    HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation

    Lourdes Moreno +3

  17. cs.CL 2026-07-02 reviewed
    Literary tools can make AI culturally literate

    World Wide Models: Literary Tools for Cultural AI

    Nina Begus

  18. cs.SE 2026-07-02 reviewed
    Fuzzing spots 1000+ hidden intents in combined AI skills

    SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

    Jinwei Hu +3

  19. cs.DB 2026-07-02 reviewed
    HNSW searches gain worst-case correctness via spanner bounds

    HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report

    Minghao Li +5

  20. cs.CL 2026-07-02 reviewed
    Directed types lift position-shift accuracy 30 points over undirected algebra

    On the Role of Directionality in Structural Generalization

    Zichao Wei

  21. cs.LG 2026-07-02 reviewed
    Granularity hierarchy reveals mixing rule gain at one level only

    HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

    Ziyun Qiao +3

  22. cs.CL 2026-07-02 reviewed
    CheckRLM fixes factual errors inside AI reasoning chains

    CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning

    Dingling Xu +10

  23. cs.CL 2026-07-02 reviewed
    BamiBERT leads on 11 of 15 Vietnamese NLP metrics

    BamiBERT: A New BERT-based Language Model for Vietnamese

    Dat Quoc Nguyen +3

  24. cs.AI 2026-07-02 reviewed
    Bounded typed-retrieval memory raises agent wins from 3/10 to 6/10

    AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents

    Xiangchen Cheng +9

  25. cs.CL 2026-07-02 reviewed
    LLM judges yield inconsistent results in multilingual settings

    Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages

    A.Seza Do\u{g}ru\"oz +5

  26. cs.CL 2026-07-02 reviewed
    Weight addition transfers instruction following to speech models

    Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

    Congrui Du +3

  27. cs.LG 2026-07-02 reviewed
    LoRA rank masking calibrates uncertainty in LLMs

    Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation

    Jijie Zhang +3

  28. cs.CL 2026-07-02 reviewed
    0.8B open model tops safety benchmarks at 90.9 F1

    HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety

    Navaneeth Sangameswaran +2

  29. cs.CL 2026-07-02 reviewed
    Two LLMs weaken on Ukrainian crisis support while one stays stable

    SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses

    Anna Chorna

  30. cs.CL 2026-07-02 reviewed
    Benchmark shows models fail to calibrate safety across intent variants

    OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

    Rheeya Uppaal +3

  31. cs.AI 2026-07-02 reviewed
    Curated non-agentic tests predict agentic scores at low cost

    PACE: A Proxy for Agentic Capability Evaluation

    Yueqi Song +10

  32. cs.CL 2026-07-02 reviewed
    Models ace multiple choice but fail open art history tasks

    EduArt: An educational-level benchmark for evaluating art history knowledge in large language models

    Gianmarco Spinaci +2

  33. cs.CL 2026-07-02 reviewed
    Embeddings predict Mandarin word durations above chance

    Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

    Xiaoyun Jin +2

  34. cs.AI 2026-07-02 reviewed
    ScopeEdit splits MLLM edits into local and gated branches

    Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing

    Siyuan Li +4

  35. cs.CL 2026-07-02 reviewed
    JSON graph scorer stays invariant under identifier swaps

    Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization

    Jan Drchal

  36. cs.CL 2026-07-02 reviewed
    TTS systems neutralize Assamese vowel contrasts in one-third of cases

    Towards a Phonology-Informed Evaluation of Multilingual TTS

    Sneha Ray Barman +2

  37. cs.CL 2026-07-02 reviewed
    LLM rewriting hurts discourse parsing more than it helps

    Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing

    Yiming Liu +6

  38. cs.CL 2026-07-02 reviewed
    Compact model tops speech instruction track with synthetic data

    NAVER LABS Europe Submission to the Instruction-following 2026 Short Track

    Marcely Zanon Boito +3

  39. physics.soc-ph 2026-07-02 reviewed
    LLMs resist skepticism by failing to represent the signal

    Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism

    Minjong Cheon

  40. cs.RO 2026-07-02 reviewed
    Divergence-free 3D Gaussian field improves robot dynamic manipulation

    PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

    Peng Yun +5

  41. cs.CL 2026-07-02 reviewed
    Fine-tuned local LLM matches frontier models on K-12 risk detection

    AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations

    Javier Irigoyen +7

  42. cs.CL 2026-07-02 reviewed
    SFT and RL train Turkish reasoning traces into 27B model

    TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

    Baran Bingol +1

  43. cs.CL 2026-07-02 reviewed
    Grammar fixes short functional links in all languages

    The Grammar Does the Work: Functional vs. Lexical Dependency Length Minimization Across Universal Dependencies

    Kim Gerdes (LISN +2

  44. cs.AI 2026-07-02 reviewed
    AUF training lifts drafter emitted length from 2.40 to 2.61

    Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters

    Tianjian Yang +1

  45. cs.CL 2026-07-02 reviewed
    Pair programming agents raise code artifact success rates

    PairCoder++: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation

    Junhao Chen +11

  46. cs.AI 2026-07-02 reviewed
    Evolved rubrics reveal agent skill failures missed by accuracy checks

    SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

    Jiayin Zhu +6

  47. cs.AI 2026-07-02 reviewed
    English-only safety leaves LLMs open to low-resource language attacks

    Safety Targeted Embedding Exploit via Refinement

    Joshua Adrian Cahyono

  48. cs.IR 2026-07-02 reviewed
    Cluster chunking shows no gain over basic methods on thesis RAG

    Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts

    Valentin J. J. Kreileder +2

  49. cs.DL 2026-07-02 reviewed
    LIS research methods differ by country but converge over 30 years

    Non-synchronism in Global Usage of Research Methods in Library and Information Science from 1990 to 2019

    Chengzhi Zhang +1

  50. cs.AI 2026-07-02 reviewed
    LLMs top out at 82.7% on aviation benchmark vs expert 95%

    Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

    Alex Brooker +1