pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

2587 papers in cs.SE · page 1

  1. cs.SE 2026-07-02 reviewed
    Benchmark tracks test changes after code commits

    TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

    Jiale Amber Wang +2

  2. cs.SE 2026-07-02 reviewed
    Traffic model spots REST API attacks at 82% recall without docs

    HTTP REST API Structure Learning

    Ran Dubin +1

  3. cs.SE 2026-07-02 reviewed
    Reasoning effort raises perfect agent code runs from 28% to 89%

    Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

    Achint Mehta

  4. cs.AI 2026-07-02 reviewed
    Constraints lift coding-agent backdoor recall from 54.5% to 90.9%

    Steerability via constraints: a substrate for scalable oversight of coding agents

    Thomas Winninger

  5. cs.SE 2026-07-02 reviewed
    Agents fix specific LLVM missed opts but often mismatch developer scope

    Understanding Agent-Based Patching of Compiler Missed Optimizations

    Batu Guan +2

  6. cs.CR 2026-07-02 reviewed
    Static scanners miss cloaked malicious agent skills

    Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

    Zimo Ji +6

  7. cs.SE 2026-07-02 reviewed
    Fuzzing spots 1000+ hidden intents in combined AI skills

    SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

    Jinwei Hu +3

  8. cs.SE 2026-07-02 reviewed
    Mixing AI coding modes reduces efficiency gains

    Developers' Experience with Generative AI Beyond Productivity Assessment -- Insights from an Empirical Mixed-Methods Field Study

    Charlotte Brandebusemeyer +4

  9. cs.SE 2026-07-02 reviewed
    VLP lifts LLM code pass rates from 29-73% to 65-93%

    Guiding Human Validation of LLM-Generated Code via Verifiable Literate Programming

    Ziqi Yuan +4

  10. cs.CV 2026-07-02 reviewed
    Optimized synthetic scenes expose 10x more VLM errors in cars

    Search-based Testing of Vision Language Models for In-Car Scene Understanding

    Lev Sorokin +3

  11. cs.SE 2026-07-02 reviewed
    Coding agents guess actions on vague DevOps instructions 56-68% of the time

    Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions

    Zimo Ji +6

  12. cs.SE 2026-07-02 reviewed
    File copying removes four dependency signals

    File-Level Copying Is an Implicit Dependency in Open Source

    Runzhi He +3

  13. cs.SE 2026-07-02 reviewed
    Prompt coverage uncovers over 30% more faults than code coverage

    Prompt Coverage Adequacy

    Florian Tambon +5

  14. cs.SE 2026-07-02 reviewed
    Model editing cuts LLM package hallucinations by 80 percent

    Mitigating Package Hallucinations in Large Language Models via Model Editing

    Shuhan Liu +5

  15. cs.SE 2026-07-02 reviewed
    Benchmark supplies 40 scalable quantum programs for testing experiments

    Benchmarking Quantum Software Testing with Scalable Quantum Programs

    Yuechen Li +4

  16. cs.SE 2026-07-02 reviewed
    Epic-organized Gherkin beats requirement-aligned on expert quality ratings

    Epic-Organized vs. Requirement-Aligned Gherkin: An Empirical Evaluation of LLM-Based Acceptance Criteria Generation

    Shahbaz Siddeeq +6

  17. cs.SE 2026-07-02 reviewed
    LLMs collapse to one wrong code solution on ambiguous tasks

    Underspecification does not imply Incoherence: The Risks of Semantic Collapse in Coding Models

    Cedric Richter +1

  18. cs.SE 2026-07-02 reviewed
    Visual graphs raise code-agent success on issue resolution

    Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution

    Jiayi Zhang +3

  19. cs.SE 2026-07-02 reviewed
    AI mandate doubles developer throughput to 2.09x baseline

    AI Writes Faster Than Humans Can Review: A Longitudinal Study of an Enterprise 2x Mandate

    Hao He +5

  20. cs.AI 2026-07-02 reviewed
    Prompt metrics add independent signal beyond code size in LLM apps

    Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code

    Zihao Xu +4

  21. cs.SE 2026-07-02 reviewed
    Detected LLM code share fell in repos from 2021-2025

    An Exploratory Study on LLM-Generated Code and Comments in Code Repositories

    Yongyi Ji +4

  22. cs.SE 2026-07-02 reviewed
    Verification Gate raises final-turn LLM code quality on every model

    Regression Accumulation in Multi-Turn LLM Programming Conversations

    Yonghui (Andie) Huang +5

  23. cs.SE 2026-07-02 reviewed
    Friction metric flags maintenance hotspots in industrial codebases

    Technical Debt Friction for Maintenance Prioritization: An Industrial Multi-Case Study

    Simeon Tverdal +5

  24. cs.SE 2026-07-02 reviewed
    Uncertainty signals weaken when defect predictors move across projects

    Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation

    Ranjun Peng +4

  25. cs.SE 2026-07-02 reviewed
    AI coding agents raise code complexity without cutting newcomer inflow

    Decoupling Code Complexity from Newcomer Participation: A Causal Study of AI Coding Agent Adoption in OSS

    Weiwei Xu +3

  26. cs.SE 2026-07-02 reviewed
    Archer flags semantic bugs in 21% of open LLVM PRs

    Archer: Towards Agentic Review for Compiler Optimizations

    Yunbo Ni +1

  27. cs.SE 2026-07-02 reviewed
    Multi-agent system localizes microservice root causes at 0.88 accuracy

    KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI

    Jiamin Jiang +11

  28. cs.SE 2026-07-02 reviewed
    Refploit repairs trajectories to reproduce 80% of Java exploits

    Refploit: Facilitating Exploit Construction via Code-Agent Trajectory Repair

    Zirui Chen +5

  29. cs.CR 2026-07-02 reviewed
    Evolved rules from few examples beat large models at smart contract checks

    Knowledge Over Parameters: Evolving Smart Contract Vulnerability Detection

    Yuqiang Sun +6

  30. cs.CV 2026-07-02 reviewed
    Captioning models filter UI noise better than pixel diffs

    Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

    Licheng Zhang +3

  31. cs.SE 2026-07-02 reviewed
    Tool detects infinite loops in 47 LLM agent projects

    When Agents Do Not Stop: Uncovering Infinite Agentic Loops in LLM Agents

    Xinyi Hou +3

  32. cs.SE 2026-07-02 reviewed
    AgentFlow maps 238 prompt-to-tool risks via dependency graphs

    AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

    Shenao Wang +4

  33. cs.SE 2026-07-02 reviewed
    Fusing repeated edits across candidates solves 41 bugs no single one fixes

    A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates

    Boyang Yang +6

  34. cs.AI 2026-07-02 reviewed
    Hawk lifts NPU kernel accuracy from 49% to 80%

    Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

    Junyi Wen +9

  35. cs.SE 2026-07-02 reviewed
    Parr curve overlaid on team capacity forecasts agile completion

    A Capacity-Aware Parr Model for Agile Projects

    Pedro E. Colla

  36. cs.SE 2026-07-01 reviewed
    Kani verifies 16000+ Rust harnesses per stdlib change

    Kani: A Model Checker for Rust

    R\'emi Delmas +11

  37. cs.SE 2026-07-01 reviewed
    GitHub issues expose four key hurdles for Matter IoT standard

    Insights from GitHub Community on the Matter Standard: Developer Perspectives and Challenges

    Muhammad Hassan +3

  38. cs.SE 2026-07-01 reviewed
    99% of SKILL.md files contain persistent skill smells

    From Anatomy to Smells: An Empirical Study of SKILL.md in Agent Skills

    David Boram Hong +2

  39. cs.SE 2026-07-01 reviewed
    Risk coverage drops sharply for AI-native teams at boundaries

    Risk Architecture for AI-Native Engineering Teams: An Organizational Framework for Agentic System Governance

    Laxmipriya Ganesh Iyer

  40. cs.SE 2026-07-01 reviewed
    CLI AI agents raise merged PRs by 24 percent

    Adoption and Impact of Command-Line AI Coding Agents: A Study of Microsoft's Early 2026 Rollout of Claude Code and GitHub Copilot CLI

    Emerson Murphy-Hill +2

  41. cs.SE 2026-07-01 reviewed
    Wrapper classifies GPU failures at 0.997 F1 with 3ms overhead

    GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures

    Parv Agarwal +1

  42. cs.SE 2026-07-01 reviewed
    Brain model signals show no link to YouTube replays

    A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

    Barada Sahu +1

  43. cs.SE 2026-07-01 reviewed
    Benchmark tracks LLM code repairs through feedback stages

    Benchmarking Code Improvement with Progressive, Adaptive, and Interactive Feedback

    Cuong Chi Le +3

  44. cs.AI 2026-07-01 reviewed
    Rewrite method certifies 105 of 185 expert problems at 91% precision

    Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

    Ben Slivinski +1

  45. cs.SE 2026-07-01 reviewed
    LLM agents rescue 41.5% of drifted repos with test edits blocked

    RepoRescue: An Empirical Study of LLM Agents on Whole-Repository Compatibility Rescue

    Zhihao Lin +6

  46. cs.SE 2026-07-01 reviewed
    Coding-agent benchmarks unreliable due to machine variance and solved tasks

    Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

    Zhi Chen +4

  47. cs.CL 2026-07-01 reviewed
    Benchmark separates model inability from policy confusion in safety tests

    Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

    Brett Reynolds

  48. cs.SE 2026-07-01 reviewed
    Agent skills form hidden supply chains with reuse and risk patterns

    Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains

    Changguo Jia +3

  49. cs.SE 2026-07-01 reviewed
    Graph layer turns prompts into reliable diagram edits

    SAGE: Structured Agentic Graph Editing for Software Diagrams

    Tyler Sivertsen +2

  50. cs.SE 2026-07-01 reviewed
    LexTester graph of Lex chats detects four times more faults

    A Model-based Testing Technique for Amazon Lex Task-based Chatbots

    Diego Clerissi +2