archive

Every paper Pith has read. Search by title, abstract, or pith.

2587 papers in cs.SE · page 1

cs.SE 2026-07-02 reviewed

Benchmark tracks test changes after code commits
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

Jiale Amber Wang +2
cs.SE 2026-07-02 reviewed

Traffic model spots REST API attacks at 82% recall without docs
HTTP REST API Structure Learning

Ran Dubin +1
cs.SE 2026-07-02 reviewed

Reasoning effort raises perfect agent code runs from 28% to 89%
Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

Achint Mehta
cs.AI 2026-07-02 reviewed

Constraints lift coding-agent backdoor recall from 54.5% to 90.9%
Steerability via constraints: a substrate for scalable oversight of coding agents

Thomas Winninger
cs.SE 2026-07-02 reviewed

Agents fix specific LLVM missed opts but often mismatch developer scope
Understanding Agent-Based Patching of Compiler Missed Optimizations

Batu Guan +2
cs.CR 2026-07-02 reviewed

Static scanners miss cloaked malicious agent skills
Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

Zimo Ji +6
cs.SE 2026-07-02 reviewed

Fuzzing spots 1000+ hidden intents in combined AI skills
SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

Jinwei Hu +3
cs.SE 2026-07-02 reviewed

Mixing AI coding modes reduces efficiency gains
Developers' Experience with Generative AI Beyond Productivity Assessment -- Insights from an Empirical Mixed-Methods Field Study

Charlotte Brandebusemeyer +4
cs.SE 2026-07-02 reviewed

VLP lifts LLM code pass rates from 29-73% to 65-93%
Guiding Human Validation of LLM-Generated Code via Verifiable Literate Programming

Ziqi Yuan +4
cs.CV 2026-07-02 reviewed

Optimized synthetic scenes expose 10x more VLM errors in cars
Search-based Testing of Vision Language Models for In-Car Scene Understanding

Lev Sorokin +3
cs.SE 2026-07-02 reviewed

Coding agents guess actions on vague DevOps instructions 56-68% of the time
Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions

Zimo Ji +6
cs.SE 2026-07-02 reviewed

File copying removes four dependency signals
File-Level Copying Is an Implicit Dependency in Open Source

Runzhi He +3
cs.SE 2026-07-02 reviewed

Prompt coverage uncovers over 30% more faults than code coverage
Prompt Coverage Adequacy

Florian Tambon +5
cs.SE 2026-07-02 reviewed

Model editing cuts LLM package hallucinations by 80 percent
Mitigating Package Hallucinations in Large Language Models via Model Editing

Shuhan Liu +5
cs.SE 2026-07-02 reviewed

Benchmark supplies 40 scalable quantum programs for testing experiments
Benchmarking Quantum Software Testing with Scalable Quantum Programs

Yuechen Li +4
cs.SE 2026-07-02 reviewed

Epic-organized Gherkin beats requirement-aligned on expert quality ratings
Epic-Organized vs. Requirement-Aligned Gherkin: An Empirical Evaluation of LLM-Based Acceptance Criteria Generation

Shahbaz Siddeeq +6
cs.SE 2026-07-02 reviewed

LLMs collapse to one wrong code solution on ambiguous tasks
Underspecification does not imply Incoherence: The Risks of Semantic Collapse in Coding Models

Cedric Richter +1
cs.SE 2026-07-02 reviewed

Visual graphs raise code-agent success on issue resolution
Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution

Jiayi Zhang +3
cs.SE 2026-07-02 reviewed

AI mandate doubles developer throughput to 2.09x baseline
AI Writes Faster Than Humans Can Review: A Longitudinal Study of an Enterprise 2x Mandate

Hao He +5
cs.AI 2026-07-02 reviewed

Prompt metrics add independent signal beyond code size in LLM apps
Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code

Zihao Xu +4
cs.SE 2026-07-02 reviewed

Detected LLM code share fell in repos from 2021-2025
An Exploratory Study on LLM-Generated Code and Comments in Code Repositories

Yongyi Ji +4
cs.SE 2026-07-02 reviewed

Verification Gate raises final-turn LLM code quality on every model
Regression Accumulation in Multi-Turn LLM Programming Conversations

Yonghui (Andie) Huang +5
cs.SE 2026-07-02 reviewed

Friction metric flags maintenance hotspots in industrial codebases
Technical Debt Friction for Maintenance Prioritization: An Industrial Multi-Case Study

Simeon Tverdal +5
cs.SE 2026-07-02 reviewed

Uncertainty signals weaken when defect predictors move across projects
Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation

Ranjun Peng +4
cs.SE 2026-07-02 reviewed

AI coding agents raise code complexity without cutting newcomer inflow
Decoupling Code Complexity from Newcomer Participation: A Causal Study of AI Coding Agent Adoption in OSS

Weiwei Xu +3
cs.SE 2026-07-02 reviewed

Archer flags semantic bugs in 21% of open LLVM PRs
Archer: Towards Agentic Review for Compiler Optimizations

Yunbo Ni +1
cs.SE 2026-07-02 reviewed

Multi-agent system localizes microservice root causes at 0.88 accuracy
KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI

Jiamin Jiang +11
cs.SE 2026-07-02 reviewed

Refploit repairs trajectories to reproduce 80% of Java exploits
Refploit: Facilitating Exploit Construction via Code-Agent Trajectory Repair

Zirui Chen +5
cs.CR 2026-07-02 reviewed

Evolved rules from few examples beat large models at smart contract checks
Knowledge Over Parameters: Evolving Smart Contract Vulnerability Detection

Yuqiang Sun +6
cs.CV 2026-07-02 reviewed

Captioning models filter UI noise better than pixel diffs
Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

Licheng Zhang +3
cs.SE 2026-07-02 reviewed

Tool detects infinite loops in 47 LLM agent projects
When Agents Do Not Stop: Uncovering Infinite Agentic Loops in LLM Agents

Xinyi Hou +3
cs.SE 2026-07-02 reviewed

AgentFlow maps 238 prompt-to-tool risks via dependency graphs
AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

Shenao Wang +4
cs.SE 2026-07-02 reviewed

Fusing repeated edits across candidates solves 41 bugs no single one fixes
A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates

Boyang Yang +6
cs.AI 2026-07-02 reviewed

Hawk lifts NPU kernel accuracy from 49% to 80%
Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Junyi Wen +9
cs.SE 2026-07-02 reviewed

Parr curve overlaid on team capacity forecasts agile completion
A Capacity-Aware Parr Model for Agile Projects

Pedro E. Colla
cs.SE 2026-07-01 reviewed

Kani verifies 16000+ Rust harnesses per stdlib change
Kani: A Model Checker for Rust

R\'emi Delmas +11
cs.SE 2026-07-01 reviewed

GitHub issues expose four key hurdles for Matter IoT standard
Insights from GitHub Community on the Matter Standard: Developer Perspectives and Challenges

Muhammad Hassan +3
cs.SE 2026-07-01 reviewed

99% of SKILL.md files contain persistent skill smells
From Anatomy to Smells: An Empirical Study of SKILL.md in Agent Skills

David Boram Hong +2
cs.SE 2026-07-01 reviewed

Risk coverage drops sharply for AI-native teams at boundaries
Risk Architecture for AI-Native Engineering Teams: An Organizational Framework for Agentic System Governance

Laxmipriya Ganesh Iyer
cs.SE 2026-07-01 reviewed

CLI AI agents raise merged PRs by 24 percent
Adoption and Impact of Command-Line AI Coding Agents: A Study of Microsoft's Early 2026 Rollout of Claude Code and GitHub Copilot CLI

Emerson Murphy-Hill +2
cs.SE 2026-07-01 reviewed

Wrapper classifies GPU failures at 0.997 F1 with 3ms overhead
GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures

Parv Agarwal +1
cs.SE 2026-07-01 reviewed

Brain model signals show no link to YouTube replays
A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

Barada Sahu +1
cs.SE 2026-07-01 reviewed

Benchmark tracks LLM code repairs through feedback stages
Benchmarking Code Improvement with Progressive, Adaptive, and Interactive Feedback

Cuong Chi Le +3
cs.AI 2026-07-01 reviewed

Rewrite method certifies 105 of 185 expert problems at 91% precision
Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Ben Slivinski +1
cs.SE 2026-07-01 reviewed

LLM agents rescue 41.5% of drifted repos with test edits blocked
RepoRescue: An Empirical Study of LLM Agents on Whole-Repository Compatibility Rescue

Zhihao Lin +6
cs.SE 2026-07-01 reviewed

Coding-agent benchmarks unreliable due to machine variance and solved tasks
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Zhi Chen +4
cs.CL 2026-07-01 reviewed

Benchmark separates model inability from policy confusion in safety tests
Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

Brett Reynolds
cs.SE 2026-07-01 reviewed

Agent skills form hidden supply chains with reuse and risk patterns
Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains

Changguo Jia +3
cs.SE 2026-07-01 reviewed

Graph layer turns prompts into reliable diagram edits
SAGE: Structured Agentic Graph Editing for Software Diagrams

Tyler Sivertsen +2
cs.SE 2026-07-01 reviewed

LexTester graph of Lex chats detects four times more faults
A Model-based Testing Technique for Amazon Lex Task-based Chatbots

Diego Clerissi +2