archive
Every paper Pith has read. Search by title, abstract, or pith.
2587 papers in cs.SE · page 1
-
Benchmark tracks test changes after code commits
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
-
Traffic model spots REST API attacks at 82% recall without docs
HTTP REST API Structure Learning
-
Reasoning effort raises perfect agent code runs from 28% to 89%
Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study
-
Constraints lift coding-agent backdoor recall from 54.5% to 90.9%
Steerability via constraints: a substrate for scalable oversight of coding agents
-
Agents fix specific LLVM missed opts but often mismatch developer scope
Understanding Agent-Based Patching of Compiler Missed Optimizations
-
Static scanners miss cloaked malicious agent skills
Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware
-
Fuzzing spots 1000+ hidden intents in combined AI skills
SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces
-
Mixing AI coding modes reduces efficiency gains
Developers' Experience with Generative AI Beyond Productivity Assessment -- Insights from an Empirical Mixed-Methods Field Study
-
VLP lifts LLM code pass rates from 29-73% to 65-93%
Guiding Human Validation of LLM-Generated Code via Verifiable Literate Programming
-
Optimized synthetic scenes expose 10x more VLM errors in cars
Search-based Testing of Vision Language Models for In-Car Scene Understanding
-
Coding agents guess actions on vague DevOps instructions 56-68% of the time
Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions
-
File copying removes four dependency signals
File-Level Copying Is an Implicit Dependency in Open Source
-
Prompt coverage uncovers over 30% more faults than code coverage
Prompt Coverage Adequacy
-
Model editing cuts LLM package hallucinations by 80 percent
Mitigating Package Hallucinations in Large Language Models via Model Editing
-
Benchmark supplies 40 scalable quantum programs for testing experiments
Benchmarking Quantum Software Testing with Scalable Quantum Programs
-
Epic-organized Gherkin beats requirement-aligned on expert quality ratings
Epic-Organized vs. Requirement-Aligned Gherkin: An Empirical Evaluation of LLM-Based Acceptance Criteria Generation
-
LLMs collapse to one wrong code solution on ambiguous tasks
Underspecification does not imply Incoherence: The Risks of Semantic Collapse in Coding Models
-
Visual graphs raise code-agent success on issue resolution
Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution
-
AI mandate doubles developer throughput to 2.09x baseline
AI Writes Faster Than Humans Can Review: A Longitudinal Study of an Enterprise 2x Mandate
-
Prompt metrics add independent signal beyond code size in LLM apps
Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code
-
Detected LLM code share fell in repos from 2021-2025
An Exploratory Study on LLM-Generated Code and Comments in Code Repositories
-
Verification Gate raises final-turn LLM code quality on every model
Regression Accumulation in Multi-Turn LLM Programming Conversations
-
Friction metric flags maintenance hotspots in industrial codebases
Technical Debt Friction for Maintenance Prioritization: An Industrial Multi-Case Study
-
Uncertainty signals weaken when defect predictors move across projects
Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation
-
AI coding agents raise code complexity without cutting newcomer inflow
Decoupling Code Complexity from Newcomer Participation: A Causal Study of AI Coding Agent Adoption in OSS
-
Archer flags semantic bugs in 21% of open LLVM PRs
Archer: Towards Agentic Review for Compiler Optimizations
-
Multi-agent system localizes microservice root causes at 0.88 accuracy
KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI
-
Refploit repairs trajectories to reproduce 80% of Java exploits
Refploit: Facilitating Exploit Construction via Code-Agent Trajectory Repair
-
Evolved rules from few examples beat large models at smart contract checks
Knowledge Over Parameters: Evolving Smart Contract Vulnerability Detection
-
Captioning models filter UI noise better than pixel diffs
Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing
-
Tool detects infinite loops in 47 LLM agent projects
When Agents Do Not Stop: Uncovering Infinite Agentic Loops in LLM Agents
-
AgentFlow maps 238 prompt-to-tool risks via dependency graphs
AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs
-
Fusing repeated edits across candidates solves 41 bugs no single one fixes
A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates
-
Hawk lifts NPU kernel accuracy from 49% to 80%
Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation
-
Parr curve overlaid on team capacity forecasts agile completion
A Capacity-Aware Parr Model for Agile Projects
-
Kani verifies 16000+ Rust harnesses per stdlib change
Kani: A Model Checker for Rust
-
GitHub issues expose four key hurdles for Matter IoT standard
Insights from GitHub Community on the Matter Standard: Developer Perspectives and Challenges
-
99% of SKILL.md files contain persistent skill smells
From Anatomy to Smells: An Empirical Study of SKILL.md in Agent Skills
-
Risk coverage drops sharply for AI-native teams at boundaries
Risk Architecture for AI-Native Engineering Teams: An Organizational Framework for Agentic System Governance
-
CLI AI agents raise merged PRs by 24 percent
Adoption and Impact of Command-Line AI Coding Agents: A Study of Microsoft's Early 2026 Rollout of Claude Code and GitHub Copilot CLI
-
Wrapper classifies GPU failures at 0.997 F1 with 3ms overhead
GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures
-
Brain model signals show no link to YouTube replays
A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps
-
Benchmark tracks LLM code repairs through feedback stages
Benchmarking Code Improvement with Progressive, Adaptive, and Interactive Feedback
-
Rewrite method certifies 105 of 185 expert problems at 91% precision
Theoria: Rewrite-Acceptability Verification over Informal Reasoning States
-
LLM agents rescue 41.5% of drifted repos with test edits blocked
RepoRescue: An Empirical Study of LLM Agents on Whole-Repository Compatibility Rescue
-
Coding-agent benchmarks unreliable due to machine variance and solved tasks
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
-
Benchmark separates model inability from policy confusion in safety tests
Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity
-
Agent skills form hidden supply chains with reuse and risk patterns
Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains
-
Graph layer turns prompts into reliable diagram edits
SAGE: Structured Agentic Graph Editing for Software Diagrams
-
LexTester graph of Lex chats detects four times more faults
A Model-based Testing Technique for Amazon Lex Task-based Chatbots