SkillCraft

Can LLM Agents Learn to Use Tools Skillfully?
A benchmark with 126 realistic tasks to evaluate whether LLM agents can identify, compose, cache, and reuse multi-step tool sequences — achieving higher success with dramatically fewer tokens.
Shiqi Chen1,*, Jingze Gai2,*, Ruochen Zhou3,*, Jinghan Zhang1,*, Tongyao Zhu4,*, Junlong Li1, Kangrui Wang1, Zihan Wang5, Zhengyu Chen2, Klara Kaleb1, Ning Miao1, Siyang Gao2, Cong Lu6, Manling Li3, Junxian He4, Yee Whye Teh1
1University of Oxford    2City University of Hong Kong    3Northwestern University    4Hong Kong University of Science and Technology    5National University of Singapore    6Google DeepMind    7Zhejiang University
* Equal contribution
📋 Browse Task Definitions 126 tasks across 21 families and 6 application domains, with full prompts and configurations.
📊 Compare Model Trajectories Side-by-side base vs. skill mode execution traces for 7 frontier models.
Overview
What is SkillCraft and why does it matter?

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills.

We address this gap by introducing SkillCraft, a benchmark explicitly designed to stress-test agent ability to form and reuse higher-level tool compositions, which we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse.

We further propose a lightweight Skill Mode evaluation protocol that enables agents to autonomously discover, compose, cache, and execute reusable Skills — accumulating a persistent library of verified skills over time. This plug-and-play mechanism dramatically improves efficiency and success, reducing token usage by up to 80%.

SkillCraft Protocol Pipeline Overview
Figure: SkillCraft Protocol Pipeline Overview. (1) Test-Time Tool Chain Evolution: The agent explores and chains atomic tools, forming executable tool sequences. (2) Iterative Skill Composition: Successful sequences are abstracted into candidate skills, executed and verified in a coding environment; failed executions trigger re-exploration. (3) Skill Library: A growing repository of verified, reusable skills that can be retrieved in later tasks to reduce low-level tool exploration.
126
Tasks
21
Task Families
7
Models Evaluated
6
Domains
79%
Max Token Savings
Benchmark Design
How SkillCraft tasks are constructed and organized
Three-stage task construction pipeline
Three-stage task construction pipeline. Stage 1 (Exploratory Phase): Survey existing benchmarks to derive task design principles. Stage 2 (Seed Task Construction): Build seed tasks from Web APIs and local tools. Stage 3 (Systematic Scaling): Scale along entity count and subtask complexity to create 126 benchmark tasks.
Task Distribution in SkillCraft
Task distribution in SkillCraft. 21 task families across 6 application domains (Food & Lifestyle, Science & Environment, Developer & Web, Education & Society, Reference, Entertainment & Gaming). Tasks span 3 difficulty levels: Easy (63), Medium (42), and Hard (21), with systematic complexity scaling through entity count and subtask multiplicity.
Main Results
Base mode vs. Skill mode comparison across 7 frontier models on 126 tasks
Model Skill Exec Reuse Factor Success (Base) Success (Skill) Δ Success Token Δ Cost Δ
Open-Source Models
Kimi-K2-Thinking 70% 3.4× 55/126 (44%) 56/126 (44%) +0.8% -42% -39%
DeepSeek-V3.2-EXP 71% 4.8× 76/126 (60%) 87/126 (69%) +8.7% -49% -51%
DeepSeek-R1 62% 3.4× 89/126 (71%) 101/126 (80%) +9.5% -30% -24%
GLM-4.7 91% 3.7× 91/126 (72%) 108/126 (86%) +13.5% -39% -41%
Minimax-M2.1 100% 3.2× 117/126 (93%) 119/126 (94%) +1.6% -11% -8%
Closed-Source Models
GPT-5.2 84% 3.8× 109/126 (87%) 114/126 (90%) +4.0% -79% -75%
Gemini 3 Pro 93% 3.9× 108/126 (86%) 116/126 (92%) +6.3% -54% -49%
Claude 4.5 Sonnet 81% 3.4× 119/126 (94%) 121/126 (96%) +1.6% -71% -74%
Data from Table 2 of the paper. Token Δ and Cost Δ represent percentage change from base to skill mode (negative = reduction).
Key Findings
Insights from evaluating 7 frontier models on the SkillCraft benchmark
1

Skill Reuse Boosts Success

Skill Mode improves task success rate across all models. Mid-tier models show the largest gains — GLM-4.7 jumps from 72% to 86%, and DeepSeek-R1 from 71% to 80% — demonstrating that skill composition can bridge capability gaps.

Up to +13.5% success gain
2

Dramatic Efficiency Gains

Agents accomplish the same tasks with far fewer tokens by reusing cached skill compositions instead of re-exploring from scratch. GPT-5.2 achieves 79% token reduction and 75% cost reduction while improving accuracy.

Up to 79% fewer tokens
3

Skill Ability Scales with Strength

Stronger models achieve higher skill execution rates (81–100%) and reuse factors (3.2–4.8×). Success rate correlates strongly with tool composition ability, underscoring skill acquisition as a core LLM capability.

3.2–4.8× reuse factor
Why Skill Mode improves efficiency
Why can Skill Mode improve efficiency? In normal mode, token-heavy tool outputs bloat the context with extraneous data — and the same verbose output becomes the next input, compounding cost. In Skill Mode, the agent composes a reusable skill that extracts only what's needed, so each piece of information only needs to pass through once.
Citation
If you find SkillCraft useful in your research, please cite our paper
@article{chen2026skillcraft,
  title   = {SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
  author  = {Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and
             Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and
             Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and
             Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
  journal = {arXiv preprint},
  year    = {2026}
}
🔬 Multi-Model Comparison
Base: 0 steps | Skill: 0 steps
📊 Statistics Comparison

Select a Task

Choose model, task group, and task

🔵 Base Mode

No Trajectory

Select a task to view trajectory

🟢 Skill Mode

No Trajectory

Select a task to view trajectory

📋 Task Showcase
Total Tasks: 0
API Tasks: 0
Local Tasks: 0
Select a task to view details
Agent Prompt
Configuration
Raw Config

👈 Select a task from the sidebar or dropdown to view its prompt

Select a task to view its configuration