MalSkillBench supplies the first sandbox-verified dataset of malicious agent skills and shows that existing detectors achieve high recall on code injection but collapse on prompt injection and agent-control attacks.
hub Canonical reference
Agent skills: A data-driven analysis of claude skills for extending large language model functionality
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 28roles
background 5representative citing papers
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.
Empirical study of 41k+ AI agent skills finds reuse is mostly one-time verbatim copying with 53% never modified afterward and maintenance focused on additive local adaptations.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
Public healthcare agent skills emphasize workflow automation over clinical diagnostics and treatments, with uneven lifecycle coverage and weak alignment between technical and clinical risk.
Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.
Skill coverage is a binary test adequacy metric that extracts observable behavior constraints from skill documents and judges whether trajectories provide sufficient evidence to cover each constraint, revealing 39.90-43.98% coverage on SkillsBench.
W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.
SciVisAgentSkills provides reusable agent skills that raise mean task scores on a 108-task SciVis benchmark when paired with Codex and Claude Code agents.
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
FederatedSkill aggregates client semantic skill diffs via a server evolution agent to enable strictly personalized skill evolution, reporting up to 44.4% higher success rates and 37.5% lower compute cost than self-evolving baselines across 20 task families.
SkillGuard presents a dual-plane permission framework for agent skills that achieves 99.76% taxonomy coverage and reduces attack success rates in evaluations on 315 skills.
AgensFlow learns coordination policies from task trajectories and outperforms fixed pipelines on distributed-systems incident and security-advisory tasks.
CODESKILL trains an LLM policy via RL on hybrid rewards to extract and maintain multi-granularity skills from agent trajectories, raising pass rates 9.69 points over no-skill baselines on three coding benchmarks while keeping the skill bank compact.
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
Introduces SRA paradigm and SRA-Bench benchmark (5,400 tasks, 26,262 skills) showing retrieval improves performance but LLMs fail to selectively incorporate retrieved skills.
SkillHone introduces a harness that maintains persistent decision histories to support continual evolution of language-model agent skills, reporting 15.8-point gains on GAIA over a commercial deep-research agent.
DataCOPE uses verifier-guided contrastive distillation from agent trajectories to discover skills, yielding average gains of 9.71% on report-style and 32.30% on reasoning-style data analysis tasks across four model settings.
SkillComposer decomposes skill construction into create/improve/merge operations trained by rejection sampling, enabling self-evolving skills that improve agent and code task performance while generalizing to unseen domains.
Skill0.5 is an agentic RL framework that internalizes general skills for hard tasks and utilizes task-specific skills for easy tasks via a dynamic difficulty-aware router to improve out-of-distribution generalization.
citing papers explorer
-
MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills
MalSkillBench supplies the first sandbox-verified dataset of malicious agent skills and shows that existing detectors achieve high recall on code injection but collapse on prompt injection and agent-control attacks.
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
-
Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware
SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.
-
From Registry to Repository: How AI Agent Skills Are Written, Adapted, and Maintained
Empirical study of 41k+ AI agent skills finds reuse is mostly one-time verbatim copying with 53% never modified afterward and maintenance focused on additive local adaptations.
-
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
-
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
-
An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance
Public healthcare agent skills emphasize workflow automation over clinical diagnostics and treatments, with uneven lifecycle coverage and weak alignment between technical and clinical risk.
-
Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security
Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.
-
Skill Coverage: A Test Adequacy Metric for Agent Skills
Skill coverage is a binary test adequacy metric that extracts observable behavior constraints from skill documents and judges whether trajectories provide sufficient evidence to cover each constraint, revealing 39.90-43.98% coverage on SkillsBench.
-
Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition
W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.
-
SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization
SciVisAgentSkills provides reusable agent skills that raise mean task scores on a 108-task SciVis benchmark when paired with Codex and Claude Code agents.
-
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
-
FederatedSkill: Federated Learning for Agentic Skill Evolution
FederatedSkill aggregates client semantic skill diffs via a server evolution agent to enable strictly personalized skill evolution, reporting up to 44.4% higher success rates and 37.5% lower compute cost than self-evolving baselines across 20 task families.
-
SkillGuard: A Permission Framework for Agent Skills
SkillGuard presents a dual-plane permission framework for agent skills that achieves 99.76% taxonomy coverage and reduces attack success rates in evaluations on 315 skills.
-
AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems
AgensFlow learns coordination policies from task trajectories and outperforms fixed pipelines on distributed-systems incident and security-advisory tasks.
-
CODESKILL: Learning Self-Evolving Skills for Coding Agents
CODESKILL trains an LLM policy via RL on hybrid rewards to extract and maintain multi-granularity skills from agent trajectories, raising pass rates 9.69 points over no-skill baselines on three coding benchmarks while keeping the skill bank compact.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
-
Skill Retrieval Augmentation for Agentic AI
Introduces SRA paradigm and SRA-Bench benchmark (5,400 tasks, 26,262 skills) showing retrieval improves performance but LLMs fail to selectively incorporate retrieved skills.
-
SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History
SkillHone introduces a harness that maintains persistent decision histories to support continual evolution of language-model agent skills, reporting 15.8-point gains on GAIA over a commercial deep-research agent.
-
Unsupervised Skill Discovery for Agentic Data Analysis
DataCOPE uses verifier-guided contrastive distillation from agent trajectories to discover skills, yielding average gains of 9.71% on report-style and 32.30% on reasoning-style data analysis tasks across four model settings.
-
SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization
SkillComposer decomposes skill construction into create/improve/merge operations trained by rejection sampling, enabling self-evolving skills that improve agent and code task performance while generalizing to unseen domains.
-
Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning
Skill0.5 is an agentic RL framework that internalizes general skills for hard tasks and utilizes task-specific skills for easy tasks via a dynamic difficulty-aware router to improve out-of-distribution generalization.
-
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents
Contractual skills framework structures SKILL.md files as readable task contracts; A/B tests on synthetic tasks show mean quality rising from 4.692 to 4.914 and critical-error rate falling from 0.083 to 0.013 across models.
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.