cs.SE — Pith

Top Pith

7

cs.SE 2026-07-01

Models reach 13.8% on executable state changes in Scratch tests

by Yufeng Lin, Jialu Zhang

ScratchWorld: Evaluating If World Models Compute Executable Consequences

Benchmark uses verified VM transitions to separate rule-following from copied persistent state.

abstract click to expand

World-model evaluations often score a predicted future by overlap with a target state or observation. In sparse-change worlds, this can turn copied persistent state into apparent accuracy. We introduce ScratchWorld, an offline diagnostic benchmark that treats Scratch projects as executable worlds and uses a pinned Scratch VM to produce replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. ScratchWorld evaluates next-state prediction, long-horizon tracking, causal event attribution, and counterfactual prediction; each replay-verified target can be presented under raw-program, structured-state, natural-language, or rendered input modalities, and our experiments use the structured-state condition. Its primary state metric is value-aware changed-field $F_1$, which gives credit only for the changed field and its executed value. In a 659-example release, seven prompted language/reasoning models reach at most 13.8% value-aware changed-field $F_1$ in a state-only partial-observation stress test. A same-instance copy diagnostic makes the overlap confound concrete: copying the input state reaches 98.0% implied full-state field accuracy and 0.0% changed-field $F_1$, with the largest inflation on real projects. Auxiliary diagnostics separate hidden-state rollout drift, intervention sensitivity, causal attribution, and perturbation robustness. Across these settings, models often react to actions or interventions without following the executable rule that determines the changed value.

4 0

Top Pith

4

cs.SE 2026-05-21 2 theorems

Refusal rate misranks LLMs on bio safety

by Lukas Weidener, Marko Brkić +3 more

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

Matched prompts show top risk discriminators often refuse fewer queries than high-refusal peers.

abstract click to expand

Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API stack predicting refusal at OR = 21.03 (95% CI: 14.58-30.34 prompt-clustered; 5.70-77.55 under model-clustered GEE). This effect is best read as access-path-level rather than model-weight-level: 99.8% of Anthropic's strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case-by-case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden's J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7's J dropped 65% from prior versions with no improvement in dual-use detection. Nine of 18 frontier models exhibit a hedge-but-help partial-compliance pattern at dual-use tier that binary refusal metrics cannot detect.

0

Top Pith

1

cs.SE 2026-05-18 2 theorems

AI agents solve at most 39% of real version upgrade tasks

by Xinbo Xu, Ruihan Yang +14 more

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

New benchmark of 115 multi-file changes from actual projects shows sharp drop from simpler bug-fix results.

abstract click to expand

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

0

cs.SE 2026-07-03

Benchmark tracks test changes after code commits

by Jiale Amber Wang, Kaiyuan Wang +1 more

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

TestEvo-Bench uses real commit data and execution checks to measure agent success at 77 percent on generation and 74 percent on updates.

abstract click to expand

Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model's training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.

0

cs.SE 2026-07-03

Traffic model spots REST API attacks at 82% recall without docs

by Ran Dubin, Amit Dvir

HTTP REST API Structure Learning

HRAL builds endpoint baselines from network data alone, outperforming alternatives when documentation is incomplete and hitting 100% with si

abstract click to expand

Application Programming Interfaces (APIs) are essential in software development, enabling web services, mobile apps, and microservices. However, their widespread use introduces significant security risks, highlighting the importance of API security. This paper presents HTTP REST API Learning (HRAL), a novel unsupervised anomaly detection approach that models the structure and behavior of API endpoints directly from network traffic, without relying on predefined rules or documentation. HRAL enables robust detection of malicious activity by understanding how APIs behave and flagging deviations as potential threats. We evaluate HRAL across varying levels of OpenAPI documentation detail and compare it with existing techniques. HRAL achieves strong performance, with an average recall of 82.07% and an F1-score of 87.24%, significantly outperforming alternatives when API documentation is limited. Moreover, our results approach the effectiveness of full API document definitions. When combined with signature-based rules such as the OWASP ModSecurity CRS, our system achieves 100% detection. These results highlight HRAL's effectiveness in real-world, partially documented API environments and its potential as a foundational layer for modern API security solutions.

0

cs.SE 2026-07-03

Reasoning effort raises perfect agent code runs from 28% to 89%

by Achint Mehta

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

90-run study finds extra reasoning cuts corrections fivefold while testing tools add cost without reliability gains.

abstract click to expand

Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals conceal. Container deployment was the dominant defect, failing first try in 44 percent of runs, with its failure rate shifting sharply across model generations while mean totals moved less than a point. The testing tool raised cost by 42 to 68 percent without improving functional score or reliability, even on interface visible criteria. Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent and cut corrective prompts about five fold, for 9 to 29 percent more cost. A design oriented prompt raised visual quality, 4.5 versus 3.0 on a 5 point scale, without lifting function, and a one paragraph paraphrase of its directive reproduced the entire lift. The practical lesson is to match the fix to the failure: most first run failures came from weak reasoning, which a stronger model or more effort prevents, not from visible flaws a checking tool would catch.

0

cs.AI 2026-07-03

Constraints lift coding-agent backdoor recall from 54.5% to 90.9%

by Thomas Winninger

Steerability via constraints: a substrate for scalable oversight of coding agents

Access controls and enforced conventions let a small reviewer model catch most inserted backdoors while cutting token cost.

abstract click to expand

Coding agents are capable; human oversight is the bottleneck. Unconstrained agents introduce security risks, erode codebase scalability, and make human review increasingly costly. We argue that the same methods used for decades to manage large human engineering teams: access control, network policies, strict coding conventions enforced by tooling; transfer directly to coding agents, and are cheaper (in token) than recent agentic scaffolding. We sketch a start-to-end system on this principle, and report a controlled experiment in scalable oversight: a small reviewer (Gemma 4 e4b) inspects a Python codebase containing 11 inserted backdoors. Recall rises from 54.5% (unconstrained, no tools) to 90.9% (constrained substrate plus a ~200-LoC `docs` CLI), with substrate and tools contributing independently. We choose Python deliberately: substrate-level oversight gains are largest where the language gives the fewest guarantees by default; the principles extend to languages like Rust.

0

cs.SE 2026-07-03

Agents fix specific LLVM missed opts but often mismatch developer scope

by Batu Guan, Zirui Wang +1 more

Understanding Agent-Based Patching of Compiler Missed Optimizations

Benchmark study finds many patches cover only part of intended scope or overlap partially, with historical PR knowledge improving alignment.

abstract click to expand

Compiler missed optimizations refer to cases in which compilers failed to optimize certain code. It takes many compiler developers' efforts to implement or patch such missed optimizations. In this paper, we present a systematic study of how well agents patch compiler missed optimizations. We identify a significant challenge that patching a missed optimization requires more than just fixing the reported case, and instead requires generalizing to similar cases. We construct a benchmark of real-world LLVM missed optimization issues and compare agent-generated patches with patches from developers in terms of optimization scope. Our results show that coding agents often optimize the given examples, but many generated patches either cover only part of the developer-intended scope or partially overlap with it; in some cases, they further generalize beyond the reference patch. We further introduce historical-knowledge augmentation techniques that leverage prior LLVM optimization pull requests through retrieval and distillation, showing that they improve developer-aligned generalization and yield practical benefits when applied to real-world IR.

0

cs.CR 2026-07-03

Static scanners miss cloaked malicious agent skills

by Zimo Ji, Congying Xu +5 more

Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware

Runtime sandbox tracking catches attacks that hide from code inspection

abstract click to expand

LLM coding agents increasingly rely on third-party agent skills from public marketplaces, which execute with the agent's privileges and create a software supply-chain attack surface: a malicious skill can steal credentials, exfiltrate source code, or install backdoors. Existing defenses use static skill scanners based on pattern matching or LLM-as-judge analysis, but it remains unclear whether they withstand adaptive evasions that preserve malicious behavior while changing payload appearance. This paper first presents an adversarial study of existing skill scanners through SkillCloak, a payload-preserving evasion framework that keeps the attack semantics intact while transforming their visible form. SkillCloak uses two complementary strategies: Structural Obfuscation, which rewrites visible payload indicators into semantically equivalent forms, and Self-Extracting Skill (SFS) Packing, which hides malicious components from the install-time view and restores them during agent execution. Across eight scanners and 1,613 in-the-wild malicious skills, SFS Packing bypasses every scanner at over 90%, while Structural Obfuscation bypasses over 80% on most static scanners and reaches 96% on a hybrid scanner, showing that appearance-based auditing is insufficient. Motivated by this finding, we propose SkillDetonate, a behavior-centric runtime auditor that executes skills in a sandbox and detects malicious effects through OS-boundary information-flow evidence rather than install-time appearance. SkillDetonate combines on-demand closure lift, which observes instructions materialized during execution, with marker-based taint analysis, which tracks sensitive-data flows across the agent context, files, processes, and network operations. The results show that SkillDetonate detects 97% of attacks at a 2% false-positive rate and sustains 87% detection on real-world malicious skills.

0

cs.SE 2026-07-03

Fuzzing spots 1000+ hidden intents in combined AI skills

by Jinwei Hu, Yi Dong +2 more

SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces

SkillFuzz uses planning artifacts and guided search to flag risky skill pairs before execution, with over 80 percent later confirmed.

abstract click to expand

Large Language Model (LLM)-based agents increasingly automate software engineering tasks through reusable skills, natural-language instruction documents that guide planning and execution. Open skill marketplaces enable users to assemble agents by co-activating community-contributed skills, but marketplace operators typically audit skills in isolation. As a result, individually benign skills may interact to redirect an agent toward unintended objectives, which we term implicit intents. Detecting such intents is challenging because the effect emerges only through skill composition, execution environments are often unavailable at admission time, and the space of possible co-activations grows exponentially with marketplace size. In this paper, we formulate implicit-intent discovery as a fuzzing problem over skill compositions, where skill compositions are the unit under test, planning artifacts expose agent intent before execution, and deviations from a skill-free baseline serve as a differential oracle. Based on this formulation, we propose skillfuzz, the first execution-free testing approach that extracts structured skill contracts and uses contract-guided Monte Carlo Tree Search to prioritize potentially conflicting compositions. Across representative skill-marketplace workloads, skillfuzz discovers over 1,000 distinct implicit intents under a fixed query budget, confirms more than 80% of the highest-risk flagged compositions during execution-time validation, and identifies substantially more high-severity implicit intents than alternative search strategies while exploring only a fraction of the pairwise interaction space they require.

0

cs.SE 2026-07-03

Mixing AI coding modes reduces efficiency gains

by Charlotte Brandebusemeyer, Kerim Zunic +3 more

Developers' Experience with Generative AI Beyond Productivity Assessment -- Insights from an Empirical Mixed-Methods Field Study

Developers gain more from in-code suggestions or chat prompts alone than from using both in the same task, per real-work observations.

abstract click to expand

With the growing adoption of AI-powered coding assistants, organizations and developers are increasingly seeking to optimize their interaction with these tools. Prior research has largely focused on output quality and productivity gains, with limited attention paid to developers' well-being and interaction experiences. This paper presents a developer-centered empirical mixed-methods study to investigate how professional developers engage with Generative AI (GenAI) in their natural work environment. Controlled data collection sessions are combined with natural work periods. Results show that developers are generally satisfied with GenAI, particularly for monotonous, repetitive, and structured tasks, and report perceived efficiency and productivity gains. Copilot interaction type preferences differ by task type and complexity: While both in-code suggestions and chat-based prompting independently improve task efficiency and reduce perceived workload, combining these interaction types within a single task diminishes benefits. We propose a rule-of-thumb for selecting an interaction type based on task characteristics. During development-heavy tasks, results indicate that perceived cognitive load arises from AI interaction, while perceived productivity depends on AI output quality. Participation in this study positively influenced developers' awareness and intentional use of GenAI tools. These findings demonstrate the value of real-world, mixed-methods study designs to understand GenAI tools and developers' experiences with them.

0

cs.SE 2026-07-03

VLP lifts LLM code pass rates from 29-73% to 65-93%

by Ziqi Yuan, Wenhao Lu +3 more

Guiding Human Validation of LLM-Generated Code via Verifiable Literate Programming

Unambiguous natural-language docs let users spot and fix intent mismatches before verification

abstract click to expand

Vibe coding democratizes software development by allowing users to generate code via natural-language (NL) interaction with large language models (LLMs). However, the code is reliable only when it faithfully implements the user's intent, which is difficult and labor-intensive for users to validate. Existing validation methods either rely on LLM-assisted automated testing, which suffers from prompt ambiguity and model fallibility, or involve users only in partial software artifacts such as prompts and test cases, which may overlook corner cases and program details. Motivated by a bug study of LLM-generated code, we find that detailed human feedback is essential, as failures often stem from underspecified requirements or subtle semantic deviations. This paper presents verifiable literate programming (VLP), a human-in-the-loop framework designed to make the review/validation process of LLM-generated code accessible to users at all programming levels. At its core, VLP proposes unambiguous NL-based documentation as a readable intermediate layer between prompts and code. The documentation demonstrates concrete program semantics and enables users to provide feedback on potential intent-code mismatches. It supports human-involved, end-to-end repair and validation via three techniques: (i) an NL-style literate language with unambiguous syntax and mostly deterministic code-to-documentation translation, (ii) LLM-based fine-grained mismatch detection that uses trace links between prompts and documentation to focus users' review effort on suspicious documentation lines, and (iii) a verification module that leverages user-validated documentation to derive API-usage checks and formal properties, which are then verified against the generated code using model checking. Our evaluation shows that VLP improves code pass@1 from 28.7%-73.2% to 65.4%-93.5% with reasonable user effort.

0

cs.CV 2026-07-03

Optimized synthetic scenes expose 10x more VLM errors in cars

by Lev Sorokin, Chen Yang +2 more

Search-based Testing of Vision Language Models for In-Car Scene Understanding

ISU-Test searches scene parameters to generate test cases that reveal incomplete or wrong outputs from vision-language models.

abstract click to expand

In the automotive domain, in-car scene understanding (ISU) enables the detection of safety-critical events, such as driver distraction, and supports drivers or passengers by analyzing the in-car scene and adapting the environment (e.g., ambient lighting). The industry is increasingly exploring vision-language models (VLMs) to interpret camera-recorded in-car scenes and extract information for downstream reasoning tasks. However, VLMs may generate incomplete, erroneous, or misleading scene descriptions, highlighting the need for systematic testing. Collecting real in-vehicle data is costly, difficult to scale, and often infeasible, particularly in early design stages. In this paper, we present ISU-Test, an automated testing approach that combines rendering-based scene generation with search-based testing to evaluate ISU systems. By framing testing as an optimization problem and systematically modifying scene parameters, our method generates diverse in-car scenarios and explores a wide range of configurations. We evaluate ISU-Test on both an industrial prototype and open-source VLMs across two case studies: question answering and captioning, comparing against randomized scenario generation. Results show that ISU-Test significantly outperforms the baseline, achieving up to 10 times higher failure rates and up to 3.6 times higher failure coverage.

0

cs.SE 2026-07-03

Coding agents guess actions on vague DevOps instructions 56-68% of the time

by Zimo Ji, Zekai Zhang +5 more

Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions

Benchmark shows agents cross safety boundaries instead of failing or seeking clarification when instructions omit intent, target, or scope d

abstract click to expand

LLM coding agents are increasingly deployed to act autonomously on real production infrastructure. They execute shell commands, modify repositories, and call operational APIs. However, completing a task is not sufficient for safety. A wrong action can cause severe consequences. Existing agent benchmarks largely emphasize task completion, leaving open how agents behave under benign but underspecified instructions. We present UnderSpecBench, a benchmark for measuring action-boundary violations in coding agents (i.e., Claude Code, Codex, and OpenCode) on DevOps tasks. UnderSpecBench includes 69 task families grounded in documented incidents, CVEs, or tool behavior and organized across four DevOps capability domains and nine operational control surfaces. To isolate underspecification from task difficulty, each task keeps the same environment and ground-truth safe action while varying the instruction along three axes: intent clarity, target certainty, and blast radius. The resulting 2,208 prompt variants are evaluated with deterministic, side-effect-based oracles that separate Safe Success, Wrong Target, and OverScope outcomes; non-action runs are further classified as clarification, refusal, or deferment. Across five agent x model configurations using OpenCode, Claude Code, and Codex, the evaluation results show that underspecification does not mainly make agents fail; it makes them guess. 55.8-67.8% of runs violate at least one boundary. Target underspecification sharply degrades action quality, while blast-radius cues barely reduce action propensity. These findings show that completion-centric evaluation can overstate safe autonomy and motivate mitigations at the model, harness, and system layer.

0

cs.SE 2026-07-03

File copying removes four dependency signals

by Runzhi He, Audris Mockus +2 more

File-Level Copying Is an Implicit Dependency in Open Source

Study of copy events shows security risks cluster in vendored code and license risks in direct reuse, both invisible to metadata scanners.

abstract click to expand

File-level copying is a widespread but ungoverned form of software reuse. Copying files across repositories reduces supply-chain visibility: it removes the four observable signals a package manager provides for a declared dependency (provenance, maintenance, security, and compliance) with no mechanism to restore them. To characterize the scale and consequences of this unmanaged reuse, we present a mixed-method study of copying across the entire open-source ecosystem using World of Code (WoC). From a 0.1% commit sample, we extract 690,500 copy events and retain 3,912 rationale-bearing copy commits for intent labeling. We show that the 13 axial copy forms, spanning vendored dependencies, hardware/driver synchronization, scaffolding, UI assets, and direct source-code reuse, are unreliable proxies for developer intent: among rationale-bearing commits, hardware/driver copies are predominantly fork-maintenance work (78%), while dependency-vendoring copies more often signal upstream bypass (70%) than offline availability. These visibility gaps are form-specific: security and license risk concentrate in complementary copy forms. Copied sources are frequently stale (median 155 days; 38.5% over one year old) and seldom record a recoverable origin (4.3% documented), let alone a checkable version (2.0% versioned); even vendored copies record where they came from only 10% of the time. Security risk concentrates in vendored dependencies: 17,314 CVE-risk copy commits in the full-WoC graph, 88% in the dependency-vendoring form; 80% score CVSS >= 7.0 and upstream-fix adoption is only 47%-84%. License risk concentrates in direct source-code reuse: 41,777 pre-validation candidates, 66% in the source-code form, with 39 verified high-star violations (kappa = 0.752). Both risks reach packaged software and are invisible to dependency scanners operating on declared metadata alone.

0

cs.SE 2026-07-03

Prompt coverage uncovers over 30% more faults than code coverage

by Florian Tambon, Michael Konstantinou +4 more

Prompt Coverage Adequacy

The criterion checks if tests meet prompt requirements using LLM attention and improves fault detection for code generated from intents.

abstract click to expand

In recent years, it has become increasingly evident that large language models (LLMs) and autonomous agents raise the level of abstraction in software development by shifting the focus from writing precise procedures to expressing intents and goals. This paradigm shift introduces new challenges, particularly in how testing should be guided when prompts, rather than code, become primary development artifacts. To address this challenge, we propose Prompt Coverage Adequacy, a novel coverage criterion designed to support the testing of code generated from task descriptions. Prompt Coverage Adequacy serves as an analog to traditional code coverage, but operates at the level of prompts used in LLM and agent-based programming. Specifically, it measures how well a given test suite satisfies the requirements expressed in a prompt by leveraging the attention mechanisms of LLMs. We evaluate a simple instantiation of this criterion, based on attention boosting, across two datasets and multiple LLMs. Our results demonstrate that Prompt Coverage is associated with fault-detection effectiveness and can uncover over 30+% more faults than traditional code coverage when used to guide test generation. These findings suggest that Prompt Coverage Adequacy can serve as a foundation for developing testing metrics better suited to the emerging paradigm of LLM-driven software development, addressing the limitations of classical coverage criteria in this new context.

0

cs.SE 2026-07-03

Model editing cuts LLM package hallucinations by 80 percent

by Shuhan Liu, Yukai Zhao +4 more

Mitigating Package Hallucinations in Large Language Models via Model Editing

BOUND refines a package-validity boundary in targeted modules, lowering error rates across recommendation and code tasks while keeping valid

abstract click to expand

Large language models (LLMs) have demonstrated strong capabilities in software engineering tasks, such as code generation, library recommendation, and dependency configuration. However, recent studies show that LLMs may suffer from package hallucination, where they generate non-existent or invalid package names. These hallucinations can be exploited in software supply chain attacks, as attackers may register malicious packages under hallucinated names. Therefore, mitigating package hallucination is important for improving the reliability and security of LLM-assisted software development. In this paper, we introduce BOUND, a lightweight localized model editing framework for mitigating package hallucinations in LLMs. BOUND formulates package hallucination mitigation as a package-validity boundary editing problem, where the boundary refers to the model's ability to distinguish valid packages from hallucinated package names under a given task context. It first locates modules related to package hallucination through a risk-aware localization strategy, and then edits these modules with lightweight LoRA adapters using a boundary-aware objective that reinforces valid packages, suppresses hallucinated packages, and preserves locality behavior. Experimental results show that BOUND effectively reduces package hallucinations while preserving valid package recommendations. In the package recommendation task, BOUND reduces package-level hallucination rate (Package-HR) by 79.9% on edit prompts and by 65.4% on unseen prompts. The learned package-validity boundary further generalizes to other package-related tasks, reducing Package-HR by 12.8% in code generation and by 34.0% in pip install recommendation. These results show that BOUND refines the package-validity boundary of LLMs and improves the reliability of package-related outputs.

0

cs.SE 2026-07-03

Benchmark supplies 40 scalable quantum programs for testing experiments

by Yuechen Li, Minqi Shao +3 more

Benchmarking Quantum Software Testing with Scalable Quantum Programs

It turns scattered open-source code into test-ready subjects with explicit criteria, enabling controlled studies of execution cost and fault

abstract click to expand

Quantum software testing (QST) checks whether quantum programs behave according to their intended specifications. A key requirement for QST research is a benchmark that supports rigorous empirical evaluation on programs that are testable and better reflect current software development practices. However, existing studies heavily rely on small hard-coded or circuit-level benchmarks, while available quantum programs are scattered across repositories without clear selection criteria, which limits fair comparison and systematic reproducibility. To this end, we present Qolumbina, a benchmark infrastructure for controlled QST experiments on scalable quantum programs. Qolumbina curates 40 programs from open-source repositories, turns them into test-ready subjects through systematic selection, refactoring, specifications, test case examples, unit tests, and standardized interfaces. We also propose QST-oriented criteria to characterize quantum programs along functionality, output behavior, development complexity, and quantum-specific execution complexity. Using these criteria, our empirical study shows that Qolumbina covers diverse testing-relevant properties and supports scalability analysis beyond fixed-size circuit benchmarks. Through controlled experiments with two recent QST approaches, we demonstrate the feasibility of using Qolumbina for execution-cost and fault-detection studies, and highlight backend-dependent effects that can influence QST result interpretation.

0

cs.SE 2026-07-03

Epic-organized Gherkin beats requirement-aligned on expert quality ratings

by Shahbaz Siddeeq, Mateen Abbasi +5 more

Epic-Organized vs. Requirement-Aligned Gherkin: An Empirical Evaluation of LLM-Based Acceptance Criteria Generation

LLM pipeline by epic scores higher on correctness and completeness with similar semantic coverage in PURE dataset evaluation.

abstract click to expand

Automated authoring of Gherkin Behavior-Driven Development (BDD) acceptance criteria remains a manual bottleneck in requirements engineering. This study investigates whether epic-organized LLM-generated Gherkin produces higher quality and coverage than requirement-aligned generation. We compare our Timeless (an epic-organized LLM pipeline) approach against a naive large language model (LLM) baseline on four requirements documents (107 requirements) from the PURE dataset. Evaluation covers structural metrics, automated requirement coverage via TF-IDF and dense embeddings, and blind expert assessment by four researchers. In our evaluation, the JSON-constrained pipeline produced structurally valid scenarios across all generated outputs, while the zero-shot baseline achieved 99% structural validity. Semantic coverage was comparable to the baseline, with Timeless achieving 94.3% semantic Requirement Coverage Rate compared with 92.9% for the baseline. TF-IDF produced lower coverage scores for the epic-organized output, suggesting that lexical metrics may miss coverage when scenarios paraphrase requirements at a higher level of abstraction. Expert raters prefer the epic-organized strategy on Correctness (4.61 vs 4.14), Executability (4.61 vs 4.07), and Completeness (4.31 vs 3.50). Overall, the results suggest that epic-organized generation can improve perceived Gherkin quality while maintaining comparable semantic coverage, although broader replication is needed before generalizing this finding.

0

cs.SE 2026-07-03

LLMs collapse to one wrong code solution on ambiguous tasks

by Cedric Richter, Mike Papadakis

Underspecification does not imply Incoherence: The Risks of Semantic Collapse in Coding Models

Instead of varying outputs to reflect prompt ambiguity, models consistently generate misaligned code, hitting over 10% of tasks in standard

abstract click to expand

Large Language Models (LLMs) have become increasingly effective at generating code when task descriptions are clear and precise. Yet, in practice, user-provided task descriptions are often ambiguous, incomplete, or contradictory, leaving critical aspects of the intended program behavior underspecified. In such cases, multiple behaviorally distinct interpretations may satisfy the description equally well, yet semantically differ in ways that matter/affect the user intent. A natural expectation, often assumed by researchers, is that prompt underspecification manifests as incoherence: When asked multiple times, an LLM produces multiple semantically distinct implementations reflecting the ambiguity of the task description. In this paper, we challenge this assumption. We find that LLMs frequently collapse onto a single incorrect interpretation of the task description, consistently generating coherent but behaviorally misaligned code. We term this failure mode detrimental semantic collapse and find that it affects over 10% of tasks in MBPP, 3% in HumanEval, and 32% of LiveCodeBench, all benchmarks assumed to be well-specified. By deliberately injecting underspecification issues in the benchmark prompts, the rate rises to over 5 times, exposing a fundamental blind spot in disambiguation and correctness estimation techniques that rely on incoherence as a proxy for prompt underspecification.

0

cs.SE 2026-07-03

Visual graphs raise code-agent success on issue resolution

by Jiayi Zhang, Kai Huang +2 more

Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution

DUALVIEW supplies four queryable repository graphs so agents avoid rebuilding structure from text at every step.

abstract click to expand

Recent advances in agentic program repair have significantly improved issue resolution by enabling iterative repository exploration. However, existing approaches predominantly rely on sequential, text-based code navigation, which fundamentally limits their ability to reason over large-scale long-horizon repositories with complex and long-range dependencies. As issue-resolution agents traverse repositories through fragmented textual observations, structural information such as module organization, call relationships, and dependency chains must be repeatedly reconstructed across interaction steps, often leading to exploration drift and incomplete localization. We present DUALVIEW, a dual-modal structural scaffolding framework that brings visual reasoning into repository exploration for issue-resolution agents. DUALVIEW represents repository structure through four complementary graph views: Module Coupling Graph (MCG), Function Call Graph (FCG), Class Hierarchy Graph (CHG), and Program Dependence Graph (PDG), and exposes them through a queryable interface with visual and textual responses. Rather than reconstructing repository structure from a sequence of textual observations, agents can directly reason over persistent visual representations of code dependencies, enabling more effective exploration and understanding of long-horizon codebases. We evaluate DUALVIEW on SWE-bench Pro and Verified. Results show that DUALVIEW consistently improves issue-resolution performance across different agent architectures and model families. Further ablation studies demonstrate that the gains arise not only from textual structural information but also from visual externalization of repository dependencies, which better supports long-horizon repository exploration.

0

cs.SE 2026-07-03

AI mandate doubles developer throughput to 2.09x baseline

by Hao He, Shyam Agarwal +4 more

AI Writes Faster Than Humans Can Review: A Longitudinal Study of an Enterprise 2x Mandate

Two-year study of 802 developers ties the doubling to adoption timing and intensity, with automated review surpassing human review.

abstract click to expand

Enterprises increasingly mandate AI coding tools and report large productivity gains, yet longitudinal evidence on how such a mandate unfolds is scarce. In this paper, we present a quantitative case study of a documented enterprise "2x" mandate at a mid-sized, AI-forward company that has been committed to doubling merged pull requests per engineer since mid-2025. In a panel of 802 developers and 196,212 pull requests (January 2024-April 2026), per-capita throughput eventually doubled, reaching 2.09x the pre-mandate baseline in April 2026, among the largest gains reported from a field deployment of AI coding tools to our knowledge. A staggered difference-in-differences design links the within-developer share of this gain to AI adoption and to a further gain that grows with accumulated use, with the mandate acting as a catalyst rather than a direct driver. Because adoption and usage intensity were not randomly assigned, we read this evidence as strongly implicating an adoption-and-use channel rather than as exact causal attribution. The gain is broadly shared across seniority yet concentrated in newer code and not separable across model generations. Adoption also restructured code review around automation: per-reviewer load roughly doubled and automated review overtook human review, while merge and revert rates held steady.

0

cs.AI 2026-07-03

Prompt metrics add independent signal beyond code size in LLM apps

by Zihao Xu, Yuekang Li +3 more

Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code

Structural breadth measures survive code-size controls and predict maintenance on held-out repositories.

abstract click to expand

LLM-integrated applications blend natural language prompts with program code, and much of their runtime behavior originates in the prompt layer rather than in the code itself. Existing complexity metrics, however, operate solely at the code level and therefore overlook this behavioral logic entirely. We present HECATE, the first tool designed to assess complexity in both the prompt and code layers of such applications. Central to HECATE is Prompt-as-Specification, a Hoare-logic-inspired formalism that interprets every prompt as a specification of intended behavior. Grounded in 25 complexity dimensions identified across published taxonomies, the tool generates 52 candidate metrics. We assess each metric against 118 components collected from 18 open-source repositories, relying on maintenance activity derived from version history as an empirical proxy for complexity, and discard any metric that loses significance once code size is accounted for. Only ten metrics withstand this test. Seven belong to our newly introduced set; rather than measuring sheer volume, each tallies structurally distinct elements, such as LLM call sites, memory attributes, and prompt templates, an attribute we call structural breadth. Of the three surviving conventional metrics, RFC exhibits a similar breadth-oriented character, while Halstead N and V survive only as a residual effect of size; our top-performing metrics exceed all three. Crucially, the prompt-layer metrics retain significance even when the strongest code-level metric is added as a covariate, establishing prompt complexity as a dimension in its own right. A final validation on 20 components spanning six held-out repositories shows that the two best-performing metrics continue to predict maintenance effort, supporting their generalizability beyond the training set.

0

cs.SE 2026-07-03

Detected LLM code share fell in repos from 2021-2025

by Yongyi Ji, Jiaji Wang +3 more

An Exploratory Study on LLM-Generated Code and Comments in Code Repositories

Proxy study of active projects finds stable comments, higher company use, test-case concentration, and few bug ties.

abstract click to expand

The use of LLMs in software development has become increasingly widespread on tasks such as code generation and summarization. Reports from large technology companies showed that around 20% to 30% of their code are generated by LLMs. However, there remains skepticism about the practical usage of LLM-generated code and comments, such as concerns on more time for debugging the generated code and the unnaturalness of the generated comments. In this paper, we study the code and comments detected as likely to be generated by LLMs and their characteristics, the differences between company- and community-maintained repositories, and how likely bugs are associated with LLM-generated code. We conduct extensive experiments on active company- and community-maintained repositories from 2021 to 2025 using various tools and techniques that detect code and comments generated by LLMs. Based on our detector-based proxy analysis, the results suggest that code detected as likely to be generated by LLMs decreased over time and appeared frequently in test cases, while that of comments remains relatively stable. Proxy results further suggest that code detected as likely to be generated by LLMs shows substantial intra-repository code clones, whereas comments exhibit a relatively low proportion of grammatically correct sentences. In addition, the company-maintained repositories show a higher percentage of code and comments detected as likely to be generated by LLMs, and only a small percentage of the human-labelled bugs are detected as being likely associated with LLM-generated code.

0

cs.SE 2026-07-03

Verification Gate raises final-turn LLM code quality on every model

by Yonghui (Andie) Huang, Lin Ma +4 more

Regression Accumulation in Multi-Turn LLM Programming Conversations

Checking each new suggestion against all prior tests prevents loss of earlier correct behavior and is the only fix that works across six LLM

abstract click to expand

In LLM-assisted software development, coding is often iterative. We study regression accumulation in multi-turn LLM programming conversations, where later code suggestions may break requirements introduced in earlier turns. Reliability therefore depends not only on satisfying the current request, but also on preserving previously satisfied behavior. We construct 542 tasks from HumanEval+ and MBPP+ and extend each task into an 8-turn requirement-evolution chain. We evaluate six LLMs on 26,016 turn instances (542 x 6 x 8). At each turn, we test whether the current code still passes earlier benchmark tests. We also analyze 384 failure cases from the failure population and build a taxonomy of multi-turn regression bugs through independent four-annotator labeling. Our results show that regression accumulation appears across all six models: 40% to 73% of tasks lose previously correct behavior over the full conversation. Final-turn quality is lower than initial-turn quality across models, especially when later turns add input validation or broader input types. Manual analysis shows that Cross-Turn Conflict, where later code conflicts with earlier requirements, is the main failure class. We further find that Verification Gate, which checks new code against prior tests and triggers rollback and retry, is the only strategy that consistently improves all models, raising final-turn quality from 75.8% to 87.9% on DeepSeek-V3 and from 31.6% to 47.3% on Llama-3.1-8B. These findings suggest that strong single-turn performance can overestimate reliability in multi-turn coding conversations. Future evaluation and tool design should test whether later code suggestions preserve earlier requirements and should include Verification Gate mechanisms.

0

cs.SE 2026-07-03

Friction metric flags maintenance hotspots in industrial codebases

by Simeon Tverdal, Phu Nguyen +4 more

Technical Debt Friction for Maintenance Prioritization: An Industrial Multi-Case Study

Multi-case study finds practitioners value technical debt friction most when combined with code health and team-structure views.

abstract click to expand

Software-intensive organizations need effective ways to identify where maintenance and refactoring efforts will yield the greatest practical benefit. Although software analytics such as code health, hotspots, and coupling provide valuable signals, they do not always capture the experienced burden of change that slows software evolution in practice. This paper presents a multi-case industrial study of technical debt friction as a prioritization-oriented concept for identifying where technical debt most strongly affects maintenance and evolution. We investigate how practitioners interpret the concept, whether friction-related analysis aligns with perceived maintenance pain points and refactoring needs, and what broader maintenance and evolution insights friction can provide beyond individual refactoring candidates. To this end, we conducted structured walkthrough sessions with practitioners across multiple industrial cases using analysis artifacts including code health, hotspots, coupling, refactoring targets, and socio-technical views. Our findings show that practitioners generally considered technical debt friction useful for reasoning about maintenance burden, especially when interpreted together with complementary technical and socio-technical views. At the file level, friction often aligned with known problematic areas and, in several cases, with files that later received maintenance attention, although its practical relevance depended strongly on context. In addition, our exploratory project-level analysis suggests that friction distributions may reveal broader maintenance and evolution patterns. These results indicate that technical debt friction is promising as a decision-support concept, but most effective when used with contextual knowledge and supporting evidence.

0

cs.SE 2026-07-03

Uncertainty signals weaken when defect predictors move across projects

by Ranjun Peng, Xuan Xie +3 more

Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation

Study of 16 classifiers on dozens of datasets shows correlations with performance vary by setting and often reverse in cross-project use.

abstract click to expand

Software defect prediction (SDP) classifiers produce probabilities used for inspection prioritization, threshold tuning, and risk communication. Probability-based uncertainty quantification (UQ) characterizes prediction confidence, but whether common UQ metrics reliably indicate performance and calibration remains unclear. We conducted a large-scale empirical study of probability-based UQ for SDP. We evaluated five UQ metrics, six performance metrics, and three calibration metrics for 16 representative classifiers. We analyzed these relationships under two prediction settings: within-project defect prediction (WPDP), using 36 benchmark datasets, and cross-project defect prediction (CPDP), using 32 feature-compatible datasets. Results showed that UQ was highly context-dependent. Under WPDP, UQ correlated more consistently with false positive rate and AUC than with MCC, F1 score, and other metrics; these correlations also varied across classifier categories and dataset collections. Performance and calibration were related but not interchangeable; classifiers with strong discrimination could still exhibit large calibration error. Under CPDP, several UQ-performance and UQ-calibration correlations weakened or reversed, indicating that uncertainty signals do not reliably transfer across projects. Thus, UQ should be evaluated against specific performance objectives. Calibration should be assessed independently using multiple metrics. Transferred probabilities should be revalidated before guiding quality-assurance decisions.

0

cs.SE 2026-07-03

AI coding agents raise code complexity without cutting newcomer inflow

by Weiwei Xu, Xuanning Cui +2 more

Decoupling Code Complexity from Newcomer Participation: A Causal Study of AI Coding Agent Adoption in OSS

Causal analysis of 603 OSS projects finds no decline in participation or retention after adoption despite modest complexity gains.

abstract click to expand

Open-source projects depend on a steady inflow of newcomers. A growing concern is that AI coding agents (tools such as Cursor and Claude Code that write code from natural-language instructions) will crowd them out, by absorbing the simple tasks that beginners start with and by making code harder to read. We give this concern a causal answer. Using GitHub code search we identify 1,888 projects that adopted an agent, signaled by their first commit of a configuration file. We apply difference-in-differences against matched non-adopting controls, restricting the main analysis to the 603 adopters with a genuine pre-adoption period. We find no evidence of crowding-out: across estimators newcomer inflow shows no significant decline after adoption (point estimates run from a small increase to, under the most conservative trend specification, a slight and insignificant dip), onboarding and retention are unchanged, and a sparse, correlational beginner-task measure (good-first-issue labels, which we cannot test for parallel trends) shows no decline. The feared mechanism is real but decoupled: adoption raises per-function code complexity (about +11% on a cognitive metric for Python, a quarter of the prior estimate, and +3 to 4% in cyclomatic terms across all languages), yet in fixed-unit subsets where complexity rose (Python on the cognitive metric, and all languages on the cyclomatic metric), newcomer participation does not decline. These results suggest that, in established open-source projects, adopting an AI coding agent makes code modestly more complex but does not crowd out the human newcomers that a project depends on: the feared trade-off between AI assistance and human participation does not materialize.

0

cs.SE 2026-07-03

Archer flags semantic bugs in 21% of open LLVM PRs

by Yunbo Ni, Shaohua Li

Archer: Towards Agentic Review for Compiler Optimizations

The agent uses obligations and executable evidence checks to review optimization changes and exposes gaps in manual oversight.

abstract click to expand

Modern compilers are frequently updated, but expert review capacity is highly limited, leading to delayed integration and, in some cases, subtle semantic bugs entering the compiler codebase. Automating the code review process with modern general code review agents may be feasible, but it faces critical challenges due to compiler complexity. In this paper, we use LLVM as our target compiler and present Archer, the first automated agentic code review tool for compiler optimizations. Archer constrains the agentic review process from both ends by using obligations to guide analysis and a deterministic validation guard to admit only findings backed by executable evidence. We evaluated Archer on 70 open PRs and 328 closed PRs in LLVM from the last two months. The review results are shocking and concerning: Archer discovers that 21% of open PRs and 11% of closed PRs are buggy, i.e, introducing semantic bugs such as miscompilations in LLVM. Our findings expose a critical gap in the capacity for critical review in large compiler projects and demonstrate the practical value of Archer as an additional reviewer.

0

cs.SE 2026-07-03

Multi-agent system localizes microservice root causes at 0.88 accuracy

by Jiamin Jiang, Jingfei Feng +10 more

KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI

KRCA pairs a skeleton causal graph with memory-augmented agents to cut diagnosis time 77 percent in hyper-scale production use.

abstract click to expand

Hyper-scale microservice systems have become the standard infrastructure for large-scale Internet companies. These systems consist of numerous loosely coupled microservices that evolve independently through continuous development and deployment. Such complexity makes failures unavoidable, necessitating efficient Root Cause Analysis (RCA) to help Site Reliability Engineers (SREs) quickly localize root cause services and classify failure types. However, existing RCA methods often struggle to adapt to the extreme dynamism and massive scale of these systems. In this paper, we present KRCA, an end-to-end RCA system designed for hyper-scale microservice systems. To manage the vast search space, KRCA employs a multi-stage pipeline that begins with an API-level drilldown to isolate suspicious services. It then instantiates a skeleton-based causal graph from anomalous metrics to serve as a high-recall structural prior, before utilizing a memory-augmented multi-agent framework to verify causality and generate the final failure report. By combining structured causal constraints with multi-agent reasoning, KRCA employs balances diagnostic accuracy with the efficiency requirements of real-time production use. Experimental results show that KRCA achieves AC@1 scores of 0.88 and 0.79 for root cause service localization and failure type classification, outperforming the strongest baseline by at lease 31% in absolute gains. KRCA has been deployed in Kuaishou's production environment for over six months, reducing the average diagnosis time by 77.3%.

0

cs.SE 2026-07-03

Refploit repairs trajectories to reproduce 80% of Java exploits

by Zirui Chen, Zhipeng Xue +4 more

Refploit: Facilitating Exploit Construction via Code-Agent Trajectory Repair

Differential execution validation plus progress-based constraint recovery lifts success 64 percent above raw agent outputs on 172 references

abstract click to expand

Vulnerability exploits play a crucial role in assessing the downstream impact of Java library vulnerabilities. While some vulnerabilities are accompanied by disclosed exploit references, automatically reproducing such references into runnable exploits remains challenging because they are often incomplete, unstructured, or only describe partial reproduction steps. Recent code agents provide a promising way to automate this process, but our study shows that their generated exploits often appear successful without triggering the actual vulnerable logic, such as replacing vulnerable APIs with self-implemented functions. To address this, we propose Refploit, an LLM-based trajectory recovery framework for facilitating vulnerability reproduction from public exploit references. The key insight is that a failed agent trajectory is not entirely useless. It may have already completed some reproduction subtasks while also revealing misleading directions that should be avoided. Refploit first validates an agent-generated exploit through differential execution. When the exploit is ineffective, Refploit analyzes its reproduction progress, locates the trajectory segments associated with the reproduction progress, and derives constraints to guide focused recovery. We evaluate Refploit on three open-source Java vulnerability datasets, covering 172 exploit references for 143 vulnerabilities. Under DeepSeek-V4-Flash, Refploit successfully reproduces 138 exploits, achieving a reproduction rate of 80.2%. It achieves a 64.3% relative improvement over the initially generated trajectories and outperforms both the SOTA exploit-generation method PoCGen and advanced code agents such as Codex with GPT-5.4. We further adapt Refploit to another code agent and observe consistent improvements, demonstrating its generality.

0

cs.CR 2026-07-03

Evolved rules from few examples beat large models at smart contract checks

by Yuqiang Sun, Han Liu +5 more

Knowledge Over Parameters: Evolving Smart Contract Vulnerability Detection

Portable logic built from ten samples per type transfers across models at under fifty dollars

abstract click to expand

Smart contract vulnerabilities are predominantly logic bugs whose detection requires structured, step-by-step procedural knowledge of attack patterns and contract semantics. Existing LLM-based methods struggle to generate this knowledge automatically: prompt-based methods rely on manually crafted detection rules, while fine-tuning requires massive labeled datasets that are inherently scarce in this domain. We present EvoVuln, an automated framework that reformulates vulnerability detection as a procedural knowledge evolution problem, synthesizing and refining detection logic using only a minimal number of labeled samples. To achieve this, EvoVuln introduces two key mechanisms. First, a Runtime with an Inversion of Control (IoC) architecture compiles detection rules into Executable Policies. This strictly decouples deterministic control flow from LLM semantic reasoning, ensuring faithful logical adherence and producing dense diagnostic telemetry for precise error localization. Second, a two-phase evolution pipeline refines the rule via abductive semantic debugging without any parameter updates: Cold Start bootstraps and stress-tests an initial rule using auto-synthesized corner cases; Few-Shot Evolving then grounds the policy in real-world semantics using only five vulnerable and five safe examples per vulnerability type. Evaluated across five real-world vulnerability types, EvoVuln achieves a 71% macro-average F1-score, outperforming all baselines. The evolved procedural knowledge is portable across models: it enables a lightweight, low-cost model to surpass a much larger zero-shot model by 19 percentage points, and transfers to other LLMs without retraining, at a one-time evolution cost under $50.

0

cs.CV 2026-07-03

Captioning models filter UI noise better than pixel diffs

by Licheng Zhang, Bach Le +2 more

Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

Benchmark shows trained methods already ignore rendering artifacts more selectively than traditional comparisons in web testing pipelines.

abstract click to expand

Visual regression testing (VRT) is a standard quality assurance step in modern software release pipelines. On every change, it re-renders user interface (UI) screenshots, compares each one against an approved baseline image, and routes any detected difference to a human reviewer who decides whether it is an intended update or an unintended regression. A widely used approach, especially in open-source and continuous-integration pipelines, is pixel-level comparison, which is semantically blind and treats rendering noise and genuine defects identically, producing large volumes of false positives that force developers and testers to spend substantial time and effort manually reviewing flagged differences at every release cycle. Industry tools apply machine learning to VRT, but lack public evaluation. More critically, no dataset or benchmark exists to support natural language descriptions of UI changes, a capability that tells testers what changed in words instead of leaving them to interpret a binary flag or a highlighted region. To address the gap, we propose a new task, Web UI Image Change Captioning (WUICC), which sits at the intersection of VRT and image difference captioning (IDC), and release WUICC-bench, its first dataset and benchmark for the task. We evaluate eleven representative IDC methods, together with two zero-shot general-purpose LLMs. We find that: (1) these methods tend to struggle in the Web UI domain due to its layout diversity, dense text, and fine-grained changes, and (2) yet the trained methods already suppress non-meaningful visual noise far more selectively than the pixel-level comparison VRT relies on, providing a solid foundation for future domain-specific research.

0

cs.SE 2026-07-03

Tool detects infinite loops in 47 LLM agent projects

by Xinyi Hou, Shenao Wang +2 more

When Agents Do Not Stop: Uncovering Infinite Agentic Loops in LLM Agents

Static analysis across 6,549 repositories flags 68 cases where agents fail to terminate, at 91.9 percent precision.

abstract click to expand

LLM agents increasingly rely on iterative execution to solve tasks through planning, tool use, state updates, and agent collaboration. While this design enables flexible automation, it also creates a new class of failures: an agent may repeatedly execute model calls, tools, workflow transitions, or agent handoffs when the feedback path is not effectively bounded. We call this problem Infinite Agentic Loops (IALs). IALs are not ordinary programming loops; they arise from the interaction between agent logic, framework semantics, runtime observations, and termination mechanisms. Such failures can amplify a single request into long running model and tool execution, causing cost exhaustion, model denial of service, context growth, and repeated external side effects. We propose IAL-Scan, a static analysis tool for detecting IAL failures in real-world LLM agent projects. IAL-Scan abstracts heterogeneous agent code into a framework independent Agent IR, builds an Agentic Loop Dependence Graph (ALDG) to recover explicit and framework induced feedback paths, and checks whether these paths can repeatedly reach costly or state growing operations without an effective bound. We evaluate IAL-Scan on 6,549 LLM agent repositories. It reports 74 potential findings, among which manual review confirms 68 IAL failures across 47 projects, achieving 91.9% precision.

0

cs.SE 2026-07-03

AgentFlow maps 238 prompt-to-tool risks via dependency graphs

by Shenao Wang, Xinyi Hou +3 more

AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

The graphs recover more agent entities and flows than AST tools across 5,399 real programs from five frameworks.

abstract click to expand

LLM agents are increasingly developed as source-code applications built on agent frameworks. These agent programs combine conventional host-language code with framework-defined semantics for models, prompts, tools, memory, and multi-agent orchestration logic. As a result, their behavior depends not only on traditional control and data flows, but also on a new class of agent dependencies. Such dependencies are often expressed as framework-induced semantics, such as agent constructors, tool decorators, and agent handoff declarations, making them difficult to recover with existing static analysis or dependency tracking tools. In this paper, we present AgentFlow, the first static analysis framework for recovering and analyzing agent dependencies from agent programs. AgentFlow constructs an Agent Dependency Graph (ADG), a framework-agnostic graph representation that represents agents, prompts, models, capabilities, memory states, and control policies as typed nodes, and captures their component-dependency, control-flow, and data-flow dependencies as typed edges. Built on ADGs, AgentFlow supports a range of analyses for agent governance and security, including Agent Bill of Materials (BOM) generation and prompt-to-tool risk detection. We implement AgentFlow for five representative agent frameworks and evaluate it on AgentZoo, a corpus of 5,399 real-world agent programs. Our evaluation shows that AgentFlow recovers richer agent entities and dependencies than existing AST-based agent static analysis tools, generates more dependency-aware Agent BOMs, and uncovers 238 taint-style prompt-to-tool risks in real-world agent programs. These results show that ADG provides a practical foundation for understanding, governing, and securing emerging agent software.

0

cs.SE 2026-07-03

Fusing repeated edits across candidates solves 41 bugs no single one fixes

by Boyang Yang, Xiangliang Hu +5 more

A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates

Deterministic atomic fusion on fixed pools from SWE-bench and Defects4J outperforms ranking, tests, and LLM judges without extra verificatio

abstract click to expand

Modern LLM coding agents are commonly evaluated using pass@k, but developers typically apply a single final patch in real-world settings. This pass@k-to-pass@1 gap is a post-generation problem: a candidate patch pool may contain a correct patch, but the system must decide which one to suggest to developers. Existing post-generation approaches mainly rank whole candidates, filter them with tests, or query an LLM judge, but none deterministically reuse shared edit-atom evidence to both select and construct the final patch. Thus, we propose PatchFusion, a deterministic atomic evidence fusion approach for candidate patches that consults no test outcome at decision time. PatchFusion first fuses whole-diff agreement into a repair neighborhood, selects an auditable representative, and then applies evidence-constrained fusion (ECF) to retain repeated edit atoms and prune unsupported parts. To evaluate this setting, we build PatchFuseBench, a fixed-pool benchmark covering SWE-bench Verified, SWE-bench Multilingual, and Defects4J candidate patches. On PatchFuseBench, PatchFusion solves 426/500 bugs on SWE-bench Verified and 236/300 on SWE-bench Multilingual, and reaches 87/371 plausible patches on Defects4J, outperforming every matched candidate-pool selector on all three. PatchFusion recovers 41 and 27 bugs that no single source solves (30 and 18 more over the best single source). Ablation studies show that ECF adds +5/+6/+9 solved bugs by recovering in-pool repairs that selection misses, with no observed regression, and that PatchFusion's gains remain stable as candidate pools are resampled. On these complementary multi-source pools, cross-candidate evidence recovers more correct patches than the test-based and LLM-based selectors we evaluate, at orders-of-magnitude lower cost, reaching within 96.2% and 89.7% of the candidate-reachable ceiling on the two SWE-bench benchmarks.

0

cs.AI 2026-07-03

Hawk lifts NPU kernel accuracy from 49% to 80%

by Junyi Wen, Ruiyan Zhuang +8 more

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Three hardware-aware modules let LLMs generate correct, fast kernels on specialized chips without training or manual priors.

abstract click to expand

Developing high-performance kernels for Neural Processing Units (NPUs) is a critical industry bottleneck, requiring developers to manually navigate implicit hardware constraints and strict memory hierarchies. While large language models offer immense automation potential, they fail catastrophically on NPUs due to a fundamental lack of hardware-specific priors. Naively transplanting code snippets from similar NPU kernels may pass the compiler, but it consistently triggers runtime crashes and performance degradation by blindly violating underlying hardware constraints. To overcome this, we introduce Hawk, a training-free framework that harnesses hardware-aware knowledge through three core modules: (1) Run-Time Knowledge Synthesis Module, which employs a Triple-Part Executable Knowledge Representation to inherently couple the error context with executable semantics; (2) Bottleneck-Aware Knowledge Retrieval Module, which implements a 2D-Retrieval paradigm to project queries into orthogonal syntactic and hardware-aligned semantic spaces; and (3) Effect-Driven Knowledge Distillation Module, which leverages LLM-driven semantic arbitration to continuously distill the knowledge by pruning errors and consolidating redundancies based on the empirical execution feedback. Extensive evaluations on real-world NPU workloads demonstrate that Hawk elevates generation accuracy from 49.4% to 80.0%, while achieving up to a 2.2x execution speedup over state-of-the-art baselines.

0

cs.SE 2026-07-03

Parr curve overlaid on team capacity forecasts agile completion

by Pedro E. Colla

A Capacity-Aware Parr Model for Agile Projects

Latent effort demand combined with observed staffing predicts progress, time, and gaps without assuming fixed activity paths.

abstract click to expand

Classical software effort distribution models, including the PNR family and Parr alter native curve, were designed to describe the time distribution of development effort under an implied staffing pattern. Their direct use in agile environments is limited when team capacity is fixed, partially fixed, or externally constrained, the original curve may prescribe a staff demand that the organization cannot allocate. This paper proposes a compact refactoring of Parr model as a capacity-aware forecasting layer for agile projects. The contribution is deliberately narrower than a full causal theory of project dynamics. A normalized Parr shaped latent effort demand is combined with an observed or planned capacity trajectory. The resulting model forecasts aggregate progress, completion time, capacity deficit, and capacity slack without assuming that the same internal activity path is followed under resource restriction. The model uses a small parameter set such as total effort K, a Parr shape parameter, an origin constant c that can match nonzero initial staffing, and the capacity trajectory. A discrete sprint formulation is provided, together with a calibration method from ordinary Scrum records and a rolling origin validation protocol against simple management baselines.

0

cs.SE 2026-07-02

Kani verifies 16000+ Rust harnesses per stdlib change

by Rémi Delmas, Zyad Hassan +10 more

Kani: A Model Checker for Rust

Translates MIR to CBMC to prove functional correctness and absence of panics beyond what the type system guarantees.

abstract click to expand

Rust's ownership type system prevents memory errors in safe code, but certain desirable properties remain orthogonal to compilation: the soundness of unsafe operations (e.g., raw pointer dereferences), functional correctness, and absence of runtime panics. We present Kani, an open-source model checker for Rust that pushes bounded model checking beyond bug-finding to provide correctness guarantees for these properties. Kani compiles proof harnesses from Rust's Mid-level Intermediate Representation (MIR) into CBMC's bit-precise verification engine, automatically checking a comprehensive set of safety properties with no user annotation. To extend verification from bounded to unbounded, Kani provides a specification language comprising function contracts, loop contracts, quantifiers, and function stubbing. We demonstrate feasibility through case studies on industrial Rust projects, where contracts upgraded verification from panic-freedom to functional correctness, uncovering six previously unknown bugs. Kani operates at scale in production CI, with over 16,000 harnesses verified per code change in the Rust standard library verification campaign.

0

cs.SE 2026-07-02

GitHub issues expose four key hurdles for Matter IoT standard

by Muhammad Hassan, Carl Gunter +2 more

Insights from GitHub Community on the Matter Standard: Developer Perspectives and Challenges

Topic modeling of 13,000 reports identifies testing and interoperability as top developer concerns, pointing to concrete fixes for the smart

abstract click to expand

Matter seeks to resolve longstanding interoperability problems in the Internet of Things (IoT), yet little is known about how developers experience the standard in day to day work. This paper examines over 13,000 issues from the official Project CHIP GitHub repository to understand the kinds of problems contributors report when implementing and integrating Matter. Using topic modeling and qualitative analysis, we identify four recurring areas of concern, Testing, Interoperability, Development, and Platform and Network, and describe how they manifest in the evolution of the codebase and tooling. The findings reveal systematic technical and integration challenges and point to concrete opportunities to refine Matter's test infrastructure, cross vendor guidance, and documentation as the standard continues to mature.

0

cs.SE 2026-07-02

99% of SKILL.md files contain persistent skill smells

by David Boram Hong, Aaron Imani +1 more

From Anatomy to Smells: An Empirical Study of SKILL.md in Agent Skills

Study of 238 skills finds guideline violations rarely disappear as files change over time.

abstract click to expand

Agent Skills provide on-demand domain knowledge to LLM agents without requiring model retraining. Each Agent Skill is defined by a mandatory SKILL.md file containing metadata and an unstructured Markdown body whose contents are left entirely to the skill author. Despite the rapid adoption of Agent Skills, little is known about how these files are authored or whether existing authoring guidelines are followed in practice. In this paper, we present the first systematic study of SKILL.md files as a software artifact. We qualitatively analyze 238 real-world skills and derive a taxonomy of 13 higher-level and 44 lower-level semantic components. We then conduct a multivocal literature review of 29 sources to identify best practices for authoring SKILL.md files and introduce skill smells as violations of these practices. Finally, we develop an automated detector and apply it to real-world skills, finding that over 99% of SKILL.md files contain at least one skill smell, and once introduced, skill smells rarely disappear as skills evolve. These findings reveal a substantial gap between recommended and actual authoring practices, motivating the development of automated techniques to remediate skill smells while increasing developer awareness of this emerging quality issue.

0

cs.SE 2026-07-02

Risk coverage drops sharply for AI-native teams at boundaries

by Laxmipriya Ganesh Iyer

Risk Architecture for AI-Native Engineering Teams: An Organizational Framework for Agentic System Governance

Profiles and taxonomy show abrupt uncovered failures where probabilistic outputs meet deterministic dependencies

abstract click to expand

Engineering management research has produced mature frameworks for software risk: ownership by feature, escalation by severity, and assurance by test coverage. These frameworks implicitly assume deterministic behavior, discrete and auditable change events, and clear component-to-owner mappings. Teams that build and operate agentic AI systems violate all three assumptions at once: outputs are probabilistic, systems take autonomous multi-step actions, and the risk surface mutates silently between deployments. Existing AI risk literature addresses this from above (policy frameworks such as the NIST AI RMF and ISO/IEC 42001) or below (threat taxonomies such as OWASP's agentic AI guidance), but not at the layer where an engineering manager (EM) operates: roles, decision rights, and escalation structures. This paper contributes (i) a seven-dimension profile distinguishing pure software-engineering, hybrid, and AI-native teams; (ii) a six-cluster failure-mode taxonomy including a previously unarticulated cluster, dependency-boundary determinism mismatch; and (iii) a synthetic framework-adequacy methodology scoring how well each profile's risk architecture detects, contains, and escalates a defined scenario set. Because the object of study is framework adequacy rather than human behavior, the evaluation yields derived rather than observed coverage claims. Coverage degrades as teams move from pure software engineering to AI-native operation, monotonically in the median and abruptly in the count of uncovered, high-consequence failures appearing only at the AI-native step. The degradation concentrates in specific failure-mode categories, and the most severe, least-covered failures arise not inside AI-native teams but at the organizational boundary where their probabilistic outputs are consumed by determinism-assuming dependencies.

0

cs.SE 2026-07-02

CLI AI agents raise merged PRs by 24 percent

by Emerson Murphy-Hill, Jenna Butler +1 more

Adoption and Impact of Command-Line AI Coding Agents: A Study of Microsoft's Early 2026 Rollout of Claude Code and GitHub Copilot CLI

The gain holds over four months in a Microsoft study of tens of thousands of engineers, with adoption spreading through peers

abstract click to expand

Organizations rolling out agentic command line tools like Anthropic's Claude Code and GitHub's Copilot CLI need to know who will try them, who will keep using them, and whether the tools produce enough output to justify their cost. At organizational scale, token spend can run into millions of dollars annually, so misreading adoption, retention, or impact can make a rollout expensive without changing engineering velocity. Studying tens of thousands of engineers at Microsoft over its early-2026 rollout, we find that first use spread primarily through social networks, retention was associated more with engineers' coding activity than with demographics, and adopters merged roughly 24% more pull requests than they would have otherwise. We use merged pull requests as our proxy for output -- acknowledging that a merged PR is not the same as the value it delivers -- and the lift persists across our four-month window. These results suggest that CLI coding agents are neither uniformly adopted nor mere novelty effects and that organizations should treat visible peer use as central to rollout strategy.

0

cs.SE 2026-07-02

Wrapper classifies GPU failures at 0.997 F1 with 3ms overhead

by Parv Agarwal, Asif Ekbal

GPUAlert: A Zero-Instrumentation Process-Boundary Monitor for Diagnosing GPU Training-Job Failures

Monitors any training command at the process boundary to preserve logs and identify causes among 15 classes with no script edits required.

abstract click to expand

GPU training jobs fail often, roughly two in five on large production clusters, yet the operator typically learns of a failure only by reconnecting hours later. Experiment trackers require editing the training script and maintaining a cloud connection; the scheduler's mail hook delivers a single status line with no cause and no logs. GPUAlert is a command-line wrapper that monitors any training command at the process boundary, and with no change to that command, emails a structured notification on completion carrying a classified failure cause, durable logs, and output artifacts. The tool is organized around three reliability primitives: a pre-launch log guarantee that establishes the durable destination before the child process can crash, notifier isolation that makes the wrapper's exit code a pure function of the child's status regardless of whether the email succeeds, and a non-silent artifact budget that bounds attachment size without ever dropping output silently. We release a labelled corpus of 474 GPU training logs across 15 failure classes and a reproducible evaluation harness. On the twelve hardware-reproduced classes, the ordered-rule classifier reaches 0.997 macro-F1, against 0.830 for unordered keyword matching and 0.133 for exit-code inspection. Wrapper overhead is a constant approximately 3ms per job; the pre-launch guarantee preserves a log where a shell redirect yields nothing; and across all 15 failure modes the wrapper returns the child's exit code unchanged even when the SMTP relay is unreachable.

0

cs.SE 2026-07-02

Brain model signals show no link to YouTube replays

by Barada Sahu, Shivesh Pandey

A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

Predicted fMRI engagement curve correlates near zero with re-watch heatmaps on 48 videos and does not beat simple baselines.

abstract click to expand

Deep multimodal brain-encoding models now predict fMRI responses to naturalistic video with high accuracy. Whether their predicted neural signals also forecast behavioral engagement is unknown. We run TRIBE, the winning model of the 2025 Algonauts brain-encoding challenge (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT), on 48 YouTube videos and reduce its predicted cortical response to a per-second engagement curve, the global field power. Correlated against each video's "most replayed" heatmap, a passively-collected proxy for which moments viewers return to, the curve shows no evidence of predicting re-watch behavior. The pooled position-controlled partial correlation is +0.058 (95% CI [-0.04, 0.15]; one-sample t(47)=1.21, p=0.23), indistinguishable from zero and not significantly above simple loudness and motion baselines (loudness +0.04, paired p=0.74). The raw correlation is also near zero; the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact rather than content prediction, and do not generalize. The null holds across six cortical-network readouts and under an autocorrelation-preserving permutation test. We release the code, the video-ID manifest, and an acquisition method that works despite YouTube's SABR-only streaming.

0

cs.SE 2026-07-02

Benchmark tracks LLM code repairs through feedback stages

by Cuong Chi Le, Aashish Yadavally +2 more

Benchmarking Code Improvement with Progressive, Adaptive, and Interactive Feedback

PAIR-Bench uses controlled hints on failure groups and detail levels to measure generalization, regressions, and assistance needs during ref

abstract click to expand

Large language models (LLMs) are typically evaluated on code generation and program repair using binary functional correctness: a generated program or patch either passes or fails a test suite. This protocol is simple but coarse, as it ignores partial progress, feedback use, regressions, and the refinement trajectory through which models often improve code. We introduce PAIR-Bench, a progressive and adaptive benchmark for evaluating code improvement: transforming an incorrect or incomplete program into a more correct one through feedback-guided refinement. PAIR-Bench uses progressive hinting, a structured feedback protocol with two controls. Failure-region control determines what the feedback targets by grouping hidden failing tests into failure scenarios, while hint-depth control determines how much repair-relevant information is revealed, from coarse symptoms to implementation-level guidance. This design enables PAIR-Bench to measure whether a model repairs targeted failures, generalizes beyond the hint, preserves already-correct behavior, and how much assistance it requires. By evaluating repair trajectories progressive metrics rather than only final pass/fail outcomes, PAIR-Bench provides a finer-grained assessment of LLM code-improvement capability.

0

cs.AI 2026-07-02

Rewrite method certifies 105 of 185 expert problems at 91% precision

by Ben Slivinski, Michael Saldivar

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

By requiring explicit justifications for every state change, it surfaces hidden premises and fabricated citations that holistic judges miss.

abstract click to expand

When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

0

cs.SE 2026-07-02

LLM agents rescue 41.5% of drifted repos with test edits blocked

by Zhihao Lin, Mingyi Zhou +5 more

RepoRescue: An Empirical Study of LLM Agents on Whole-Repository Compatibility Rescue

Union across systems hits 62.7%, but coordinated whole-codebase changes defeat most Claude Code agents.

abstract click to expand

Open-source libraries and tools are widely reused, but compatibility maintenance is expensive. Once maintainers leave, useful repositories can stop working as runtimes and dependencies evolve. We study whether LLM agents can adapt old repositories to modern environments, a task we call compatibility rescue. Unlike bug repair, compatibility rescue starts from a repository that worked in its original environment but fails after ecosystem drift. RepoRescue gives agents only the repository and its failing modern environment; the agent must diagnose the failure, locate affected code, and produce a source-code rescue that restores the historical test suite. We build RepoRescue from 193 Python and 122 Java repositories, each verified to pass historically and fail after modernization. We evaluate five deployed agent systems on Python and three on Java. Beyond full-patch pass rate, we rerun patches after removing test-file edits to measure source-only repair, add a runtime-enforced regime that blocks test edits, and validate practical use for repositories whose suites pass after rescue. We find that Claude Code systems sometimes edit failing tests even when prompted not to; with runtime blocking, Kimi still rescues 41.5% of repositories. Systems are complementary: their union reaches 62.7%, exceeding the best single system by 10.9 points. Difficulty concentrates in cross-file coordination: on 14 repositories requiring coordinated whole-codebase changes, GPT-5.2 through Codex passes all 14, while every Claude Code system passes at most two. Finally, a passing suite is only an initial signal: among 34 unmaintained Python candidates whose suites pass after rescue, 22 work in realistic scenarios and 12 pass bug-hunt with patches that address the compatibility failure. RepoRescue benchmarks compatibility rescue with source-only auditing, runtime enforcement, practical validation, and reasoning labels.

0

cs.SE 2026-07-02

Coding-agent benchmarks unreliable due to machine variance and solved tasks

by Zhi Chen, Zhensu Sun +3 more

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Reference patches meet validity rules across machines on only a minority of tasks while public submissions already beat most references.

abstract click to expand

Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code optimization tasks across four common types of Google Cloud machines. Most benchmark tasks can be replayed, but their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks; SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes. Second, we show that public submission rankings depend strongly on the benchmark scoring rule. Among eight public submissions shared by GSO and SWE-fficiency, the official rankings disagree on 9 of 28 pairwise submission comparisons, and SWE-fficiency's leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%. Third, looking across 10 public submissions for each task, we find that at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks, and beats the unoptimized base code on 99.8% (449/450). Our study complements leaderboard scores by identifying tasks with more reliable performance signals, quantifying per-task score contributions, and exposing the remaining performance gaps that are hidden by aggregate rankings.

0

cs.CL 2026-07-02

Benchmark separates model inability from policy confusion in safety tests

by Brett Reynolds

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

An 18-item set with controlled ambiguities measures whether refusals reflect capability limits or unclear rules.

abstract click to expand

Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmarks often compress these distinctions into pass/fail labels, obscuring whether failures arise from capability limits, policy ambiguity, instruction conflict, scaffold failure, or unstable evaluator judgments. This paper introduces adversarial pragmatics as a benchmark and annotation protocol for evaluating model behaviour under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts. The contribution is empirical and methodological: a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, a 54-row local seed pilot, an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, and metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The framework turns linguistic judgment methodology into a practical tool for validating safety evals, LLM judges, gold-set construction, prompt-injection tests, and safety documentation.

0

cs.SE 2026-07-02

Agent skills form hidden supply chains with reuse and risk patterns

by Changguo Jia, Tianqi Zhao +2 more

Skills Are Not Islands: Measuring Dependency and Risk in Agent Skill Supply Chains

Analysis of 1.43 million skills uncovers four structural patterns and security signals missed by isolated checks.

abstract click to expand

Agent skills package reusable operational knowledge for Large Language Model (LLM) agents, yet as they grow in scope, they become dependency-bearing artifacts whose identities, versions, and provenance remain implicit. This opacity already causes duplicated dependencies and inconsistent installations, exposing a gap that dependency management has yet to close. We introduce Agent Skill Supply Chains (ASSCs) to characterize mixed skill-package-service dependency graphs and help close this gap. Borrowing from Software Bill of Materials (SBOMs), we design SkillDepAnalyzer to capture natural-language dependency evidence and model skills as dependency-bearing artifacts. On the SKILL-DEP benchmark, SkillDepAnalyzer recovers skill metadata and dependency graphs accurately and comprehensively, substantially outperforming an LLM-based baseline and package-centric SBOM tools. Applying SkillDepAnalyzer to over 1.43 million skills, we obtain ASSCs and explore their structural diversity and security signals. We find four structural patterns: skill metadata is activation-ready but governance-poor; dependency graphs span skill, package, and service dependencies with concentrated reuse; recursive skill reuse expands dependency graphs and creates hidden package inventory; and skill dependency clusters form around related workflows. We also find that inspecting a skill alone misses security-relevant signals hiding in its dependencies. By analyzing ASSCs, we identify and report known malicious skills persisting in ASSCs to their developers. Based on these findings, we recommend typed dependency manifests, first-class dependency-cluster management, risk-warning audit commands for skill infrastructure maintainers, and lockfile-like records for skill developers.

0

cs.SE 2026-07-02

Graph layer turns prompts into reliable diagram edits

by Tyler Sivertsen, Neal Singh +1 more

SAGE: Structured Agentic Graph Editing for Software Diagrams

SAGE decomposes natural language requests into validated graph operations that preserve layout and connections in Draw.io and Mermaid files.

abstract click to expand

Software diagrams are difficult to edit through human-friendly interfaces because edits expressed in natural language must still preserve visual layout, editable structure, and semantic relationships. As a step forward, we present SAGE, a browser-based tool for prompt-guided editing of Draw.io and Mermaid-style engineering diagrams. The tool maps diagrams into an editable graph representation, translates natural language requests into structured edit intents, analyzes those intents into graph-oriented operation steps, validates and repairs common Draw.io XML issues, and stores successful results as recoverable versioned artifacts. This design separates structured state management from model-driven interpretation, while acknowledging that some prompt-guided XML edits remain model-assisted. The tool also supports direct canvas editing and a secondary mask-based image-editing workflow. We evaluate the system using unit tests and a Kubernetes architecture case study, measuring structural validity, edit success, preservation of unrelated elements, and failure causes.

0

cs.SE 2026-07-02

LexTester graph of Lex chats detects four times more faults

by Diego Clerissi, Alessandro Vasina +1 more

A Model-based Testing Technique for Amazon Lex Task-based Chatbots

Automated exploration builds complete interaction model and produces complex tests that beat Botium on coverage and bug finding at similar c

abstract click to expand

Task-based chatbots are nowadays widely adopted software systems, usually integrated into real-world applications and communication channels, designed to assist users in completing tasks through conversational interfaces. Like any other software, even chatbots are prone to bugs. Despite their increasing pervasiveness in everyday activities, existing techniques for assessing their quality still exhibit several limitations, such as the simplicity of generated test scenarios and oracle weaknesses. In this paper, we present LexTester, an automated model-based testing technique for Amazon Lex chatbots. The technique explores the conversational space of the chatbot under test to generate a Dialog Graph of all possible interactions, from which an executable test suite is generated according to different coverage strategies. LexTester was evaluated against the state-of-the-practice chatbot testing tool Botium on five Amazon Lex chatbots, consistently outperforming it in all subjects, generating more tests with nearly double complexity, achieving overall 83-95% coverage of conversational elements, and improving fault detection effectiveness by up to four times at comparable time costs.

0

cs.SE 2026-07-02

Judgment turns AI agent failures into lasting governance controls

by James C. Davis, Paschal C. Amusuo +3 more

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

A 12-week case study shows high-velocity code production repeatedly surfaces the same structural problems that must be converted into reusab

abstract click to expand

Generative AI is shifting software engineering from a practice organized around scarce implementation effort toward one organized around abundant, low-cost code production. This shift changes the central engineering problem: not whether AI can generate useful code, but how engineers organize architectures, tools, evidence, and feedback loops so that AI-mediated development remains inspectable, correctable, and maintainable. We study this problem through a first-person case study: a 12-week development effort in which a single expert software engineer used frontier AI coding agents to build a document accessibility remediation system. The empirical record comprises 88 contemporaneous field notes, 420 KLOC of production code, and 1.16 MLOC of tests, lints, supporting documentation, and agent tooling. From this record, we develop a candidate middle-range theory of governance conversion, expressed as a process model explaining how high-velocity agentic implementation becomes governable. The model explains how agentic implementation velocity surfaces recurring structural failure classes, and how engineering judgment sustains velocity by converting those failures into durable governance mechanisms. In contrast to existing governance models that derive controls from known obligations, governance conversion explains how controls are discovered from failures that become visible only during agentic work. We use our model to make testable predictions and to describe implications for software engineering research and practice.

0

cs.SE 2026-07-02

Dependency graph plus agents tops API error detection

by Tyler Stennett, Myeongsoo Kim +2 more

AutoRestTest at the SBFT 2026 Tool Competition

Method processes 317 operations across 11 services in one hour while leading in fault finding and efficiency metrics.

abstract click to expand

Large input spaces and complex inter-operation dependencies make black-box REST API testing challenging. AutoRestTest combines a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models to intelligently explore large API input spaces. In the SBFT 2026 REST League, AutoRestTest ranked first in all three evaluation categories -- fault detection, overall efficiency, and overall effectiveness -- on 11 APIs (317 operations, approximately 29 per API), averaging 67.09 unique server errors and 17.27 successfully processed operations per API under a one-hour testing budget.

0

cs.SE 2026-07-02

Correct solvers trace more systematic paths across code syntax nodes

by Kyogo Horikawa, Hidetake Uwano +1 more

Identifying Effective Program Comprehension Strategies through Gaze Transitions over Syntactic Elements

Eye data mapped to abstract syntax tree nodes shows ordered transitions mark successful program readers

abstract click to expand

Program comprehension is a central research topic in software engineering, focusing on how developers understand a program's structure, behavior, and intent. Eye-tracking studies have traditionally relied on display-based measurements, where gaze positions are represented as screen coordinates. However, syntax-based analyses have recently emerged. Prior work proposed methods to convert eye movements into transitions between nodes in an abstract syntax tree, but the relationship between task correctness and eye-movement features for specific syntactic elements remains unclear. This study converts eye-tracking data into transitions between syntactic nodes and analyzes fixation proportions and gaze transition patterns. We investigate the relationship between these patterns and task correctness, comparing correct and incorrect groups. Our results reveal distinct differences in gaze transition patterns between the two groups. In particular, successful participants exhibit more systematic transitions across syntactic elements, suggesting the use of structured reading strategies.

0

cs.SE 2026-07-02

Runtime diagnoses from multi-faceted tests raise agent fix rates

by Yaoqi Guo, Yang Liu +4 more

SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests

SWE-Doctor builds diagnosis records by running and debugging tests for multiple issue aspects, cutting partial patches on SWE-bench.

abstract click to expand

Large language model (LLM)-based software engineering agents are increasingly developed to resolve software issues by generating patches from issue reports and code repositories. Bug reproduction tests (BRTs) are an important building block for such agents and have been shown useful for patch validation. However, it remains unclear whether BRTs can also help the more central stage of patch generation. We first conduct a preliminary study and find that directly using advanced BRT generators to guide patch generation is not beneficial: fail-to-fail BRTs can mislead agents, while even fail-to-pass BRTs bring limited or negative gains. Our analysis reveals two reasons: fail-to-pass BRTs may cover only one manifestation of the reported issue, leading to partial patches, whereas fail-to-fail BRTs are unreliable as direct patch-generation targets. Motivated by these insights, we propose SWE-Doctor, a software issue resolution agent that guides patch generation with runtime diagnoses derived from multi-faceted BRT executions. SWE-Doctor first generates multi-faceted BRTs for different behavioral requirements stated in the issue, then executes and debugs these BRTs to construct runtime-grounded diagnosis records, and finally uses the diagnoses together with localization information inferred during BRT generation to guide patch generation and reduce partial patches. We evaluate SWE-Doctor on Python bug-fixing issues from the widely adopted SWE-bench Verified and SWE-bench Pro across five LLM backends. SWE-Doctor consistently outperforms existing agents across all 10 LLM-benchmark combinations, achieving average resolution rates of 75.7% on SWE-bench Verified and 59.4% on SWE-bench Pro. In particular, on the more challenging SWE-bench Pro, SWE-Doctor improves the average resolution rate by 8.0-8.9 percentage points over the baseline agents.

0

cs.SE 2026-07-02

LLM agents turn NL into quantum code for test optimization

by Ming Tao, Yuechen Li +3 more

Leveraging LLM-Based Agentic Systems to Generate Quantum Applications for Test Optimization

QPipe hits 100% compilation and 96.7% execution on 20 benchmarks, often beating genetic algorithms

abstract click to expand

Quantum computing is increasingly explored for software engineering (SE) optimization, but translating natural-language (NL) task-level requirements into executable quantum applications still demands substantial quantum and programming expertise. We present QPipe, a large language model (LLM)-based multi-agent architecture that autonomously turns NL requirements into traceable quantum-application workflows through specialized agents for requirement parsing, formulation, code generation, review, execution, and verification. We evaluate QPipe on 20 NL requirements, each associated with a real-world benchmark and a test-optimization problem. QPipe successfully completes the key stages of quantum-application generation across requirements, achieving average rates of 100% for code compilation and 96.7% for application execution and final-result combination, with average generation costs of 260.1 seconds and 1.89M tokens per requirement. Among the generated quantum applications that execute successfully, the returned solutions outperform the offline genetic algorithm baseline in most cases. Ablation results further show that QPipe's advantage depends on retaining code-generation skills, task knowledge, review feedback, and multi-agent decomposition. These results indicate that agentic coordination can support generation of executable quantum applications for tackling test optimization problems from real-world benchmarks.

0

cs.SE 2026-07-02

Metamorphic testing removes oracle requirement from delta debugging

by Mingyue Jiang, Yongqiang Tian +1 more

Delta Debugging in the Absence of Test Oracles Through Metamorphic Testing

DDMT lets input reduction continue on programs where output correctness cannot be verified directly.

abstract click to expand

Delta debugging provides an automatic way to minimize a program input while preserving a certain property. However, its effectiveness fundamentally relies on the availability of test oracles to determine whether a reduced input still preserves the specific property. Consequently, the oracle problem substantially limits the applicability of existing delta debugging techniques, particularly for oracle-deficient programs where output correctness cannot be directly determined. To address this problem, this paper proposes a novel approach, DDMT, to enhance the applicability of delta debugging, especially facilitating its application to oracle-deficient programs. Our key insight is to redesign an oracle-independent test function and incorporate it into the reduction procedure of delta debugging such that the property-preservation validation can be accomplished without requiring a test oracle. To this end, DDMT employs the technique of metamorphic testing, which is a property-based and oracle-independent testing method. It establishes a metamorphic testing-based test function, using it as a replacement for the original test function adopted by delta debugging. The experiments evaluate DDMT on 66 subjects across both oracle-available and oracle-deficient scenarios, with different delta debugging approaches. The results positively confirm that DDMT can enhance the applicability of delta debugging while often preserving or improving reduction effectiveness and query efficiency. Furthermore, compared to the relevant delta debugging approaches, DDMT is also able to achieve performance improvements with proper configurations.

0

cs.SE 2026-07-02

AI agent skills mostly copied once and left unchanged

by Haoyu Gao, Jai Lal Lulla +4 more

From Registry to Repository: How AI Agent Skills Are Written, Adapted, and Maintained

Study of 3,709 reuse links finds 53% never edited after adoption, with changes adding local knowledge rather than rewriting core contracts.

abstract click to expand

AI coding agents increasingly rely on skills: structured context bundles, typically a SKILL.md file with a YAML header and Markdown body, loaded on demand for domain knowledge, workflows, and scripts. Public registries such as skills.sh now host tens of thousands of skills, making them an emerging unit of reuse in agent-based software engineering. Yet skills have largely been viewed as agent capabilities rather than software artefacts whose content and evolution shape agent behaviour. We present the first empirical study of AI agent skills as engineered artefacts that are authored, reused, customised and maintained, across public registries and personal-use repositories. We mined 18,463 skills from skills.sh and 23,199 personal-use skills from 5,876 GitHub repositories, identifying 3,709 reuse links. LLM-based classification into SWEBOK knowledge areas (KAs) shows Software Construction dominates alongside a long tail of specialised areas. A thematic analysis of 180 skills identifies six content categories. Qualitative coding of 444 modifications reveals six themes, of which reworking operational specifications and adapting knowledge and resources are the primary target of change. Our findings show that reuse is largely a one-time copy operation: most reused skills remain near-verbatim, 53\% are never modified after adoption, and subsequent local maintenance is overwhelmingly additive. Customisation primarily adapts skills to local environments, whereas evolution accretes new inline domain knowledge. Across both, a stable behavioural contract -- how a skill interacts with users, monitors runtime state, and recovers from failures -- remains almost untouched. These results suggest maintenance effort should focus on project-specific bindings, and that registries and tool support should enable consolidating the domain knowledge skills re-author in isolation.

0

cs.SE 2026-07-02

Bytecode turned into Petri nets checks Java concurrency

by Akshatha Shenoy, Carlo A. Furia

Petrify: Petri-net Based Analysis of Concurrency Properties in Java Bytecode

The resulting nets stay compact enough for model checkers to decide deadlock without sharp growth in cost as parameters increase.

abstract click to expand

The landscape of automated formal verification is populated by techniques that make prominently different trade-offs: some focus on expressiveness and precision, supporting the verification of complex properties; others favor scalability and practicality, so that they are applicable to larger programs using different features. This paper presents Petrify, a novel automated verification technique for concurrency properties that achieves a distinctive trade-off. Petrify encodes the semantics of Java bytecode programs into Petri nets (PNs), which can be analyzed by state-of-the-art model checking tools such as LoLA. As our experiments demonstrate, Petrify's approach offers an interesting combination of expressiveness and practicality: PNs are a fairly precise encoding of the concurrent behavior of programs; at the same time, Petrify's PN encoding is succinct, so that its analysis remains quite insensitive to parameter size. Another practical benefit of targeting bytecode is that jPetrify, the prototype tool that implements the Petrify technique, is applicable to programs written in any version of Java and even a subset of Kotlin (another language that compiles to Java bytecode) while other similar tools are limited to older versions of Java. While this paper's experiments focus on analyzing fundamental properties like deadlock, Petrify's approach lends itself to be extended to other kinds of concurrency analysis, which we plan to tackle in future work.

0

cs.SE 2026-07-02

Agent repairs 83% of 55 C/C++ vulnerabilities using past fixes

by Sicong Cao, Hao Ma +9 more

Knowledge-Enhanced Agentic Vulnerability Repair

KeaRepair builds knowledge bases from historical patches, lets an agent collect facts, then retrieves and refines patches in a validation lo

abstract click to expand

Frontier foundation models have changed the math on vulnerability discovery, but the bigger challenge is how the remediation side keeps up. Despite recent progresses in Automated Vulnerability Repair (AVR), current solutions struggle to reliably identify the root causes of vulnerabilities, and insufficiently utilize the prior fix knowledge to guide the patch generation process, thus undermining their effectiveness in practice. To address this gap, we propose KeaRepair, a novel agentic AVR approach that grounds patch generation in verified program facts and high-level vulnerability knowledge. Specifically, KeaRepair first extracts multi-dimensional vulnerability knowledge from historical vulnerability-patch pairs from dual complementary views, and constructs dedicated retrieval knowledge bases. It then employs a tool-augmented agent that performs ReAct-style reasoning to collect verified program facts for vulnerability diagnosis. Finally, based on the diagnostic results, KeaRepair performs knowledge-level retrieval-augmented patch generation and iteratively refines patches through a closed-loop validation process involving compilation, PoC replay, and test-suite execution. Experimental results show that KeaRepair significantly outperforms existing AVR approaches on 55 reproducible C/C++ vulnerabilities. When paired with Gemini-3.1-Pro, KeaRepair successfully repairs 46 vulnerabilities, achieving a repair rate of 83.64%. Moreover, KeaRepair fixes six unique vulnerabilities that none of the baselines can address, and further demonstrates strong cross-language generalizability.

0

cs.SE 2026-07-02

Stochastic model estimates microservice availability from traces

by Anatoly A. Krasnovsky, Anna Maslovskaya

Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis

Monte Carlo on reconstructed graphs and probability measures replaces repeated fault-injection tests

abstract click to expand

Microservice availability is commonly assessed by fault injection and chaos experiments, but such experiments are costly, operationally risky, and difficult to repeat for every architectural change. Distributed tracing and deployment metadata provide cheaper evidence, yet they usually remain descriptive: they show which services interacted, not what endpoint-level availability property follows. This paper proposes a formal runtime availability model based on stochastic connectivity for resilience-oriented analysis of microservice endpoints. It treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates. Its semantics separates computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies. The model can be reconstructed from traces and deployment artifacts, parameterized for architectural what-if analysis, and analyzed by Monte Carlo simulation before or alongside fault injection. We define the model, its trace-to-model construction, elementary semantic properties, and a synthetic adequacy study. The study matches closed-form oracle cases within sampling error and exposes boundaries caused by edge bottlenecks, correlated failures, missing traces, and time-dependent failures.

0

cs.SE 2026-07-02

LLMs clarify code requirements poorly despite coding skill

by Zheng Fang, Dongming Jin +5 more

ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation

New benchmark shows clarification ability separates from generation performance and declines sharply with denser ambiguities.

abstract click to expand

Large Language Models have emerged as programming assistants. However, the efficacy of code generation is constrained by the quality of input requirements, which are frequently ambiguous, incomplete, or underspecified. While LLMs excel at one-shot code synthesis, their ability to proactively clarify intent remains underexplored, as a critical trait for robust software engineering. Existing benchmarks largely overlook this interactive bottleneck, assuming perfectly specified prompts that do not reflect the iterative nature of requirement elicitation. To bridge this gap, we introduce ClarifyCodeBench, a novel interactive benchmark for evaluating LLMs' capability in resolving requirement ambiguity. Constructed from real-world programming tasks, ClarifyCodeBench features high-quality manual annotations, including N unique ambiguity types, associated clarification questions, and corresponding ground-truth answers. Furthermore, we formalize two rigorous metrics to assess the interaction quality: Turn-discounted Key Question Rate, which penalizes inefficient questioning, and Optimal Round Adherence, which measures the precision of the elicitation process. We conduct a systematic evaluation of six state-of-the-art LLMs using ClarifyCodeBench. Our empirical results yield three critical insights: 1) Capability Decoupling: Strong code generation performance does not inherently translate to effective requirement clarification; 2) The Reasoning Paradox: While increased computational thinking enhances code correctness, it yields marginal gains in identifying ambiguities; 3) The Multi-ambiguity Ceiling: LLMs' clarification performance degrades sharply as the density of ambiguities increases, revealing a significant bottleneck in handling complex, real-world specifications. Our work underscores the necessity for future AI4SE research to transition from static synthesis to interactive elicitation.

0

cs.SE 2026-07-02

Ensemble of LLMs fixes up to 22% of LLVM compiler issues

by Zhao Tian, Yingquan Zhao +3 more

LLVM-Bench: Benchmarking and Advancing Large Language Models for LLVM Compiler Issue Resolution

New benchmark of 423 real tasks shows single models lag on patches and builds but combining outputs raises success

abstract click to expand

LLVM is a widely used compiler infrastructure whose scale and complexity make issue resolution labor-intensive and challenging. Although large language models (LLMs) have recently achieved remarkable success in issue resolution, their effectiveness on complex system-level LLVM compiler remains largely unexplored. To address this gap, we introduce LLVM-Bench, the first large-scale benchmark for LLVM issue resolution, containing 423 real-world, validated tasks collected from the LLVM project. We further develop LLVM-Gym, a scalable evaluation platform that automates issue reproduction, patch application, compiler building, and test execution. Using LLVM-Bench and LLVM-Gym, we conduct a comprehensive study of four representative LLMs, six retrieval configurations, and three agents. Our results show that current LLM-based issue resolution techniques remain limited on LLVM-Bench, with patch invalidity and build failures as the dominant failure modes. We further reveal a strong complementarity among different LLMs and agents, motivating LLVM-Ens, a lightweight ensemble approach that expands the patch space through integrating the patches generated by diverse techniques, filters incorrect and redundant candidates, and identifies the most promising solution. Our results show that LLVM-Ens achieves a resolution rate of up to 21.99%, further improving LLVM issue resolution.

0

cs.SE 2026-07-02

Sound oracle certifies Scratch recoveries from video with no false accepts

by Yuan Si, Jialu Zhang

Checked Program Recovery from Execution Video: A Sound Oracle for Untrusted Generators

The static checker issues certificates that never accept incorrect programs, while real projects recover at a vocabulary-limited 14 percent

abstract click to expand

A growing class of tools recovers a program from observations of its behavior using an untrusted generator, a neural model or a search, that proposes candidates with no correctness guarantee. We study how to make such recovery trustworthy, in the concrete setting of recovering a runnable Scratch program from a recording of its execution. The recording shows what the program does but never its code; many programs produce the same video, so the source cannot be recovered, and the right target is a program that behaves the same as far as the camera can tell, made precise with a lens. The core is a two-tier validation oracle with a deliberate verdict asymmetry. A static checker proves lens-equivalence to a reference and issues a certificate that, granting the partial-order independence quotient adequate, never accepts a wrong program; a renderer can only refute or witness finite agreement, never certify. Around it, Vid2Prog reads each sprite's motion, visibility, and timing from the video and a known-asset manifest and synthesizes a candidate source-free; a closed loop renders and runs recovery again for ground truth. Under the exact lens the oracle makes no false accept on 246 labeled differing pairs, including an adversarial battery built to trap its concurrency quotient; on inputs outside the vocabulary and on real projects it abstains or refutes, accepting none we test. In-vocabulary recoveries reproduce their source frame for frame and 80% earn a static certificate, while whole real projects, mostly outside the vocabulary, recover at 14%, a vocabulary-bound rate the system never inflates with a wrong answer. A frontier vision-language model recovers none of the matched programs single-shot, which oracle-in-the-loop repair lifts only to a few while the structured pipeline recovers all, the gap a sound checker makes for an untrusted generator.

0

cs.SE 2026-07-02

Survey finds gender shapes views on terms like master/slave

by Ahmad J. Tayeb, Mohammad D. Alahmadi

The Perception and Impact of Non-inclusive Language in Software Artifacts

Women and non-binary respondents more often label the terms non-inclusive and report stronger effects on belonging.

abstract click to expand

Terminology such as "whitelist/blacklist," "master/slave," "man-hours," or "dummy value" has long been part of the technical vocabulary used in software artifacts, including source code, version histories, and documentation. In recent years, however, many of these expressions have been recognized as potentially non-inclusive and unwelcoming to groups historically underrepresented in software development, such as people of color, women, and individuals with disabilities. Consequently, a growing movement within the software industry has sought to replace these terms with more inclusive alternatives. Despite these initiatives, little is empirically known about how software developers perceive such terminology or how its continued use may influence their professional experiences and sense of belonging. This paper addresses the knowledge gap by examining how software developers perceive non-inclusive terminology in software and its perceived impact on team dynamics, productivity, belonging, and well-being. We surveyed open-source contributors and received 1,517 responses, of which 1,212 were complete and analyzed. On average, respondents reported low negative workplace impact overall; however, perceptions and impacts varied by demographic group. Women and non-binary participants, as well as respondents residing in the United States, were more likely to view the terms as non-inclusive. Among those who considered the terminology non-inclusive, non-binary participants reported higher overall negative impacts than male respondents, and female participants reported higher impact specifically on their sense of belonging.

0

cs.SE 2026-07-02

21% of student Scratch projects vary by script order

by Yuan Si, Jialu Zhang

SchedCheck: Schedule-Robustness Analysis for Event-Driven Block Programs

SchedCheck checks one representative schedule per dependence class and finds the rate holds on public projects too.

abstract click to expand

Block-based languages such as Scratch let beginners assemble interactive programs from sprites and scripts. These programs are concurrent in practice: green-flag scripts, broadcasts, and clones run as cooperatively scheduled threads over shared sprite and stage state, and their authors never write a thread. We show that such programs contain schedule-sensitive behaviors whose observable result depends on an execution order the language leaves open. Editing, saving, or remixing a project can produce a copy with the same blocks but a different layer order, changing the order the virtual machine starts scripts. We formalize the schedule space a Scratch virtual machine can realize as the permutations of the initial executable-target order, and define schedule-robustness against a lattice of observation lenses over a fixed horizon. A partial-order exploration runs one schedule per dependence-equivalence class, and on projects small enough to enumerate, an independent oracle confirms it recovers every realizable outcome. On larger projects, representatives stand in for the factorial under the validated dependence model. SchedCheck implements this on the production Scratch VM. Across 224 real student projects, at least 21% of the concurrent ones are schedule-sensitive at the grading lens, and a uniform random sample of public projects replicates the rate at 17.6%, with two real remixes of a deployed animation arranging its letters differently. On hand-built fault pairs and a generated benchmark of 32 spec-defined faults across four classes, the tool detects and localizes every schedule fault, with a logic-fault control reporting clean. The oracle exposed four unsoundness gaps in the dependence model, all repaired. The method is parametric in the execution model, instantiating unchanged on a second cooperative event loop.

0

cs.SE 2026-07-02

CoHiKer raises kernel fault localization accuracy up to 57 percent

by Haichi Wang, Ruiguo Yu +5 more

Towards Better Linux Kernel Fault Localization: Leveraging Contrastive Reasoning and Hierarchical Context Analysis

Contrastive analysis of mutated tests plus step-wise code narrowing yields higher precision at both file and method levels while using fewer

abstract click to expand

Debugging the Linux kernel remains a formidable challenge due to its vast codebase, complex architecture, and low-level programming intricacies. Effective fault localization (FL) is thus essential for efficient kernel debugging and maintenance. While existing FL techniques (both traditional and LLM-based) have shown promise in general-purpose software, they are ill-suited for the kernel context. In particular, recent LLM-based techniques often treat bug reports and source code as plain text, lacking deep integration of kernel-specific knowledge, which limits their ability to identify root causes and achieve fine-grained localization. We present CoHiKer, a novel LLM-based FL technique tailored to the Linux kernel. CoHiKer introduces two key innovations: (1) contrastive reasoning, which identifies root causes by analyzing the behavioral divergence between carefully mutated passing and failing test cases, and (2) hierarchical context analysis, which systematically narrows the localization scope from files to methods by integrating crash reports, syscall semantics, inter-file dependencies, and kernel-specific features. Unlike prior techniques that rely on static understanding and full-code input, CoHiKer decomposes the localization task and enables structured LLM prompting to reason semantically over meaningful contexts. We evaluate CoHiKer on an extended Linux kernel bug dataset against five state-of-the-art baselines. CoHiKer consistently outperforms all competitors, improving Top-1 localization accuracy by up to 26.07% at the file level and 56.85% at the method level over state-of-the-art LLM-based baselines, while achieving up to 8.84% and 28.9% reductions in token consumption, respectively. Furthermore, CoHiKer demonstrates strong generalizability on the non-kernel dataset, with comparable gains (15.5% and 5.3% in Top-1 at file and method levels).

0

cs.SE 2026-07-02

Active learning bounds how often AI patterns show up in code

by Srinath Perera, Hasinthaka Piyumal +2 more

A Methodology for Investigating AI Patterns Prevalence in Software Repositories

14 classes drawn from 44 sources; model on 100 GitHub repos beats random baseline and supplies numeric prevalence limits.

abstract click to expand

As Artificial Intelligence(AI)-based applications take off, a clear understanding of AI patterns can uplift the quality of AI applications. Many AI patterns have been proposed in the literature; however, their prevalence in real-life code has not yet been validated. Understanding the actual use of those patterns in practice can clarify our understanding both of the significance of these patterns and their utility. In this paper, we present a methodology to a) identify relevant patterns by mining the literature and then to b) validate their presence and prevalence in actual code repositories using active learning. To that end, we identify 14 AI pattern classes by mining 44 published AI pattern-related sources. Then we use an active learning approach to determine the prevalence of the most common pattern class across 100 GitHub open AI repositories. Using prevalence estimation, we propose bounds on the accuracy of the occurrences. The model achieves 56\% accuracy and 55\% recall in an 8-way classification task, significantly outperforming the 11\% random-chance baseline. Furthermore, the prevalence estimation offers usable bounds for analyzing pattern applications. This methodology provides a robust foundation to start understanding how AI patterns are used in practice, a field that currently lacks empirical data.

0

cs.SE 2026-07-02

LLM agents detect 31 new bugs in PyTorch

by Shaoyu Yang, Haifeng Lin +7 more

Rise From The Ashes: LLM-based Static Analysis for Deep Learning Framework Bugs

Static SBIR modeling of tensor flows finds issues across backends without running tests.

abstract click to expand

Deep learning (DL) frameworks are critical AI infrastructures that often hide bugs with serious security implications. While dynamic approaches such as fuzzing are effective in uncovering these bugs, they require real test execution and incur high computational costs. Static analysis is a natural complement because it can detect bugs without runtime execution, offering fast and scalable testing. Unfortunately, there is still limited work targeting static analysis for DL frameworks due to their multilingual architectures and tensor-related program state. We present Phoenix, the first LLM-based static analysis technique for DL frameworks. Our key insight is that cross-language tensor flows in DL frameworks can be modeled, together with concrete code context, as a structured semantic bridge intermediate representation (SBIR) that LLMs can analyze for potential bugs in tensor semantic propagation. We implement this insight through a multi-agent workflow. A summarization agent first distills bug summaries from historical bug-fix patches and CWE rules. Guided by each summary, an extraction agent identifies bug-relevant repository symbols for code retrieval, and a generation agent synthesizes grounded SBIRs from the retrieved context. Finally, an analysis agent is leveraged to check SBIRs and report potential bugs. Our evaluation shows that Phoenix is a practical complement to dynamic DL framework testing for bug finding. To date, Phoenix has found 31 real new bugs in PyTorch for different heterogeneous hardware backends (Intel CPU, NVIDIA CUDA, and Apple MPS). Among them, 20 submitted bug-fixing patches have been merged into upstream.

0

cs.HC 2026-07-02

Developers accept AI under oversight but limit it on identity tasks

by Rudrajit Choudhuri, Christian Bird +3 more

You Shall Not Pass! Where and Why Developers Draw The Line on AI Autonomy

Survey of 448 Microsoft developers ties lower autonomy acceptance to task identity, accountability, and personal risk tolerance.

abstract click to expand

As AI takes on more software work, the line between human and AI effort is shifting. Where developers draw that line around AI autonomy bears on how we design tools and roles that preserve meaningful work. Drawing on cognitive appraisal theory, work design, and automation research, we conducted a mixed-methods study of 448 professional developers at Microsoft to investigate their accepted levels of AI autonomy across software engineering work. Most developers accepted AI producing work under their oversight, although accepted autonomy varied substantively across tasks and individuals. Acceptance was lowest for identity-defining, human-facing, and design-oriented work, and higher among developers with more AI experience and risk tolerance. Task accountability was associated with lower odds of allowing AI to act on developers' behalf, whereas task identity was associated with lower odds of granting AI decision-making autonomy. Task demands had the opposite effect, increasing willingness to delegate decision-making to AI. Our findings suggest that preferences for AI autonomy reflect how developers cognitively experience their work, highlighting important considerations for designing meaningful work.

0

cs.SE 2026-07-02

Quantum software comparisons rarely survive locked audit

by Boshuai Ye, Peng Liang +2 more

Auditing Empirical Comparisons in Quantum Software

Of 455 claims from 119 papers, only 8 expose matched evidence; 2 sustain, 4 remain unresolved, 2 reverse.

abstract click to expand

Empirical quantum-software papers often report that one compiler, optimizer, backend, or ansatz outperforms another. Such comparisons are not properties of a tool alone: they can change with benchmark scope, circuit construction, compilation, sampling, backend or noise assumptions, optimizer choices, and resource budgets. Existing testing, benchmarking, and reproducibility methods help assess programs, tools, executions, and platforms, but they do not directly audit whether the reported comparison itself is supported by the evidence exposed in the source paper or accompanying materials. We present CLAIMSTAB-QC, a source-bounded framework for auditing empirical comparisons in quantum software. Given a reported comparison, the framework records the baselines, metric, relation, and admissible evidence; locks the comparison design before outcomes are computed; and reports either a scoped relation outcome or an explicit evidence boundary. For strict scalar-directional comparisons, the reported direction is classified as Sustained, Unresolved, or Reversed within the locked audit scope. We evaluate CLAIMSTAB-QC on 455 comparative claims from 119 quantum-software papers. The central finding is a materialization gap: 175 claims can be represented for audit planning, 79 become scalar-directional planning records, 53 yield lockable audit or diagnostic designs, and only 8 expose enough matched evidence to audit the original comparison without proxy reconstruction. These 8 records yield 2 Sustained, 4 Unresolved, and 2 Reversed outcomes. Controlled diagnostics over 24 benchmark-relevant comparisons further show that simpler checks can preserve apparent directions whose support weakens under locked audit designs.

0

cs.SE 2026-07-02

LLMs outperform baselines at detecting equivalent mutants

by Honglin Shu, Zhao Tian +6 more

Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study

Fine-tuned models reach higher F1 scores on Java and C pairs while generalizing across languages.

abstract click to expand

Mutation testing is a powerful technique for ensuring software quality. However, the presence of equivalent mutants introduces unnecessary costs and biases, limiting its practical effectiveness. Although numerous equivalent mutant detection (EMD) methods have been proposed, they often face distinct challenges: pure-code analysis methods can be limited by their reliance on specific compiler infrastructures, while existing machine-learning approaches remain constrained by scarce training data and limited generalization to unseen mutants. Large language models (LLMs) have recently demonstrated remarkable performance across diverse code-related tasks by better capturing program semantics. Yet their potential for EMD remains largely unexplored, particularly in the multi-lingual context. This paper presents the first comprehensive empirical study on LLMs for EMD, using 3,302 Java and 1,088 C mutant pairs to benchmark against state-of-the-art methods, explore strategy variations, assess efficiency, and evaluate cross-lingual generalization. Experimental results show that LLM-based approaches achieve higher F1-scores than the evaluated traditional methods, with fine-tuned code embedding yielding the highest detection accuracy among the tested strategies. Moreover, LLM-based approaches strike a practical balance between effectiveness and efficiency with inference times comparable to existing machine-learning models. Importantly, fine-tuned LLMs demonstrate measurable generalization across programming languages. These findings establish LLMs as a viable and efficient approach for tackling the longstanding challenge of equivalent mutant detection, offering new directions for advancing mutation testing in practice.

0

cs.SE 2026-07-02

Contributors protect GitHub projects

by Mohit Kaushik, Kuljit Kaur Chahal

Analysis of 73,000 repositories shows labor capacity outweighs visibility and easy-access features worsen the popularity penalty.

abstract click to expand

Social coding platforms such as GitHub host millions of repositories, yet many suffer from high mortality rates. Despite this, several survival factors remain poorly understood. Human capital is widely recognized as essential. Social attention, while often assumed to be a lifeline, can become a liability. Structural features that improve onboarding, such as code readability and documentation, may also accelerate the cessation of active development when combined with massive visibility. To examine these dynamics, we analyzed more than 73,000 GitHub repositories using an Accelerated Failure Time (AFT) survival framework, which accounts for the time-varying nature of predictors. Our study identifies human capital as the most critical determinant of project survival. In contrast, excessive social attention emerges as a liability, and when coupled with accessibility features, it amplifies the risk of project inactivity. Importantly, when the number of contributors interacts with social popularity, the protective effect of labor becomes visible, highlighting the need for governance strategies that balance visibility with labor capacity to ensure the long-term resilience of open-source projects.

0

cs.SE 2026-07-02

BT-APE matches heavy prompt engineering accuracy at 72% lower token cost

by Mohammad Amin Zadenoori, Waad Alhoshan +3 more

BT-APE: A Computationally Light Backtracking Approach to Automatic Prompt Engineering for Requirements Classification

Backtracking search plus dynamic examples delivers PE2-level results for requirements classification while cutting input tokens by 72 percen

abstract click to expand

Large language models (LLMs) are increasingly applied to requirements engineering (RE) tasks, yet the prompts guiding them are typically designed manually through trial and error, yielding inconsistent and suboptimal results. Automated prompt construction remains largely unexplored in RE, leaving its effectiveness unclear. To address this, we propose a lightweight Automatic Prompt Engineering approach, Backtracking APE (BT-APE), and apply it to requirements classification. We frame prompt design as an optimization problem, iteratively refining prompts via LLM-generated candidates, backtracking search, and dynamic example selection. Evaluating BT-APE on three benchmark datasets with five instruction-tuned LLMs, we compare it against four classical prompting baselines (zero-shot, few-shot, chain-of-thought, CoT+few-shot) and a state-of-the-art but resource-intensive APE baseline (PE2). BT-APE and PE2 achieve nearly identical accuracy, both substantially outperforming the classical baselines with large effect sizes; however, BT-APE imposes a far lighter computational footprint, consuming roughly 72% fewer input tokens and 66% less wall-clock time at equivalent accuracy, making it better suited to resource-constrained deployment. Our contributions are threefold: (i) a lightweight APE framework with an open interactive tool and replication package; (ii) the first systematic comparison of APE against classical prompting for requirements classification; and (iii) insights into how class definitions and prompt evolution affect performance.

0

cs.SE 2026-07-02

Trace data turns agent model choice into economic decisions

by Richard Kang, Vincent Wang

Registry-Governed Agent Lifecycle:Completing EDDOps with Evaluation-DrivenRegistration, Promotion, and Retirement on AWS AgentCore

Registry governance on AWS AgentCore uses evaluation evidence to balance quality, cost, and reliability across the agent lifecycle.

abstract click to expand

Enterprise adoption of LLM agents requires model selection methods that balance quality, reliability, safety, latency, and cost. Evaluation-Driven Development and Operations (EDDOps) positions evaluation as a continuous governing function across the agent lifecycle rather than a terminal checkpoint. This paper presents a practitioner-oriented instantiation of EDDOps on AWS Bedrock AgentCore and proposes a cost-to-performance framework for selecting foundation models in enterprise agent architectures. We make three contributions: a conceptual synthesis explaining why traditional TDD/BDD methods are insufficient for non-deterministic LLM agents; an architectural mapping of the EDDOps reference architecture onto AgentCore Runtime, Evaluations, Agent Registry, and CloudWatch observability; and an empirical cost-to-performance decision framework validated through a proof-of-concept comparing three foundation models across two deployment paths. Using trace data from 30 single-turn invocations across six agents, 9 multi-turn evaluations, and registry-integrated governance, we show how evaluation evidence can convert model selection from a benchmark-ranking exercise into a governed economic decision. The results suggest that managed agent platforms can support EDDOps when they provide trace-native observability, pluggable evaluator frameworks, and governed registry-based discovery.

0

cs.SE 2026-07-01

AI C++ code twice as likely to trigger runtime violations

by Saif Mahmud, Fadul Sikder +3 more

The Illusion of Safety: Multi-Tier Verification of AI vs. Human C++ Code

Multi-tier checks on 8,918 programs reveal static analysis hides the safety gap with human code

abstract click to expand

Large language models increasingly generate C++, a memory-unsafe language where a single overlooked violation can become an exploitable bug. Yet most security evaluations of AI-generated code rely on static analysis alone, which flags warnings without confirming runtime violations or reasoning about untested paths. We ask whether AI-generated C++ is measurably less safe than human-written code, and whether common verification tools agree on the risk. We introduce VULBENCH-CPP, a benchmark of 8,918 C++ programs from three open-weight LLMs (Gemma 3 27B IT, LLaMA 3.3 70B Instruct, Qwen 2.5 Coder 32B Instruct) and human authors across 851 competitive-programming tasks. Each program is annotated by four verification tiers: functional testing, static analysis (cppcheck, clang-tidy), dynamic analysis (ASan/UBSan), and bounded model checking (ESBMC). Accounting for the correlation among solutions to a shared task, we find that AI-generated code is roughly twice as likely as human code to trigger a confirmed runtime violation, even after controlling for code length and test pass-rate. Under static analysis the two look equally safe, but this is misleading: the apparent similarity reflects code length rather than real safety, and the tiers detect largely different classes of violation, so no single tier is sufficient. The gap is consistent across independent generations.

0

cs.SE 2026-07-01

Tool produces 56k method context records from 20 Java repos

by Alessandro Botta, Shiven Garisa +4 more

CoCoMUT: A Tool for Code-Context Mining and Automated Dataset Generation

Reconciles 97.8% of call edges to source and passes 99% manual audit

abstract click to expand

Software-engineering assistants often need method-level context beyond an isolated body, including enclosing-class information, documentation, callers, callees, type hierarchy, and structural characteristics. Manually collecting this context is time-consuming, inconsistent, and difficult to reproduce across large Java projects. We present CoCoMUT, a Java tool for Code-Context Mining and Automated Dataset Generation. CoCoMUT extracts context for a focal method or generates datasets at class, package, or system scope. It discovers project structure, resolves build and classpath information, constructs a SootUp static call graph, and reconciles bytecode-level call edges with Spoon-based source extraction. Each method record combines source, class, documentation, call-graph, and metadata context, providing reproducible inputs for training and running learned software-engineering techniques. The key contribution is a reusable, task-independent pipeline that unifies build discovery, source extraction, call-graph construction, source-bytecode reconciliation, and versioned JSON dataset generation. The resulting records can be consumed individually as context for a focal method or collectively as datasets for documentation, explanation, testing, review, repair, search, and program-comprehension workflows. We evaluate CoCoMUT on 20 real-world Java repositories evenly split between Maven and Gradle. CoCoMUT processed all 20 repositories, emitting 56,512 method-context records and 386,048 serialized call edges. Among call edges whose bytecode targets belonged to project source, CoCoMUT reconciled 97.8% to source method identities. In a manual audit of 200 randomly sampled methods across 10 systems, 99.0% of generated context records passed all applicable correctness checks.

0

cs.SE 2026-07-01

Resolver signals fail to beat age for interface adoption

by Faruk Alpay, Baris Basaran

Interface-Variant Dynamics in Software Ecosystems: Resolver-Induced Selection and Adoption in Package Graphs

Temporal tests on four major registries show checker-free resolver features underperform age-only baselines when forecasting blocked-to-admi

abstract click to expand

Compatibility research usually treats an interface change as a local writer-reader decision. Distributed software stacks make that decision population structured: an RPC, telemetry, middleware, or service-contract variant is introduced by one provider release and then spreads, stalls, or is mediated across consumers, transitive dependencies, and resolver rules. This paper asks when that observation is a load-bearing software-engineering estimator rather than evolutionary relabeling. We mine interface histories, audit npm, Maven Central, PyPI, and crates.io package graphs, execute 2100 package-manager resolver probes, estimate an ecosystem-specific selection coefficient $s$ from clean conflict probabilities, and use that measured $s$ to forward evaluate a pairwise-comparison absorbing process on the observed package graph. We separate three evidential roles. Fixation is a forward evaluation, not independent evidence: once $s$ is measured, deviation from $1/N$ follows mechanically from the non-neutral process. Checker-derived direction carries adoption signal: a direction-permutation null gives checker-direction gap MAE 0.07 versus null median 0.43 ($p=0.002$). But because that direction is derived from the same boundary state whose admitting frequency is predicted, it is a diagnostic rather than an orthogonal selection test. The stricter checker-free temporal test asks whether early resolver-channel features predict later blocked-to-admitted flips; in this snapshot they do not beat age-only (Brier 0.28 versus 0.24, AUC 0.51 versus 0.54). The result is a reproducible estimator audit for interface-variant dynamics in distributed package graphs, showing where resolver evidence becomes population input and where the current registry data still fail to close the resolver-to-adoption loop.

0