SkillCraft

Can LLM Agents Learn to Use Tools Skillfully?

A benchmark with 126 realistic tasks to evaluate whether LLM agents can identify, compose, cache, and reuse multi-step tool sequences — achieving higher success with dramatically fewer tokens.

Paper (arXiv) GitHub PDF

Shiqi Chen^1,*, Jingze Gai^2,*, Ruochen Zhou^2,*, Jinghan Zhang³, Tongyao Zhu⁵, Junlong Li³, Kangrui Wang⁴, Zihan Wang⁴, Zhengyu Chen⁶, Klara Kaleb¹, Ning Miao², Siyang Gao², Cong Lu⁶, Manling Li⁴, Junxian He³, Yee Whye Teh¹

¹University of Oxford ²City University of Hong Kong ³Hong Kong University of Science and Technology ⁴Northwestern University ⁵National University of Singapore ⁶Independent

* Equal contribution

📋 Browse Task Definitions 126 tasks across 21 families and 6 application domains, with full prompts and configurations.

📊 Compare Model Trajectories Side-by-side base vs. skill mode execution traces for 7 frontier models.

Overview

What is SkillCraft and why does it matter?

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills.

We address this gap by introducing SkillCraft, a benchmark explicitly designed to stress-test agent ability to form and reuse higher-level tool compositions, which we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse.

We further propose a lightweight Skill Mode evaluation protocol that enables agents to autonomously discover, compose, cache, and execute reusable Skills — accumulating a persistent library of verified skills over time. This plug-and-play mechanism dramatically improves efficiency and success, reducing token usage by up to 80%.

Figure: SkillCraft Protocol Pipeline Overview. (1) Test-Time Tool Chain Evolution: The agent explores and chains atomic tools, forming executable tool sequences. (2) Iterative Skill Composition: Successful sequences are abstracted into candidate skills, executed and verified in a coding environment; failed executions trigger re-exploration. (3) Skill Library: A growing repository of verified, reusable skills that can be retrieved in later tasks to reduce low-level tool exploration.

126

Tasks

Task Families

Models Evaluated

Domains

79%

Max Token Savings

Benchmark Design

How SkillCraft tasks are constructed and organized

Three-stage task construction pipeline. Stage 1 (Exploratory Phase): Survey existing benchmarks to derive task design principles. Stage 2 (Seed Task Construction): Build seed tasks from Web APIs and local tools. Stage 3 (Systematic Scaling): Scale along entity count and subtask complexity to create 126 benchmark tasks.

Task distribution in SkillCraft. 21 task families across 6 application domains (Food & Lifestyle, Science & Environment, Developer & Web, Education & Society, Reference, Entertainment & Gaming). Tasks span 3 difficulty levels: Easy (63), Medium (42), and Hard (21), with systematic complexity scaling through entity count and subtask multiplicity.

Main Results

Base mode vs. Skill mode comparison across 7 frontier models on 126 tasks

Model	Skill Exec	Reuse Factor	Success (Base)	Success (Skill)	Δ Success	Token Δ	Cost Δ
Open-Source Models
Kimi-K2-Thinking	70%	3.4×	55/126 (44%)	56/126 (44%)	+0.8%	-42%	-39%
DeepSeek-V3.2-EXP	71%	4.8×	76/126 (60%)	87/126 (69%)	+8.7%	-49%	-51%
DeepSeek-R1	62%	3.4×	89/126 (71%)	101/126 (80%)	+9.5%	-30%	-24%
GLM-4.7	91%	3.7×	91/126 (72%)	108/126 (86%)	+13.5%	-39%	-41%
Minimax-M2.1	100%	3.2×	117/126 (93%)	119/126 (94%)	+1.6%	-11%	-8%
Closed-Source Models
GPT-5.2	84%	3.8×	109/126 (87%)	114/126 (90%)	+4.0%	-79%	-75%
Gemini 3 Pro	93%	3.9×	108/126 (86%)	116/126 (92%)	+6.3%	-54%	-49%
Claude 4.5 Sonnet	81%	3.4×	119/126 (94%)	121/126 (96%)	+1.6%	-71%	-74%

Data from Table 2 of the paper. Token Δ and Cost Δ represent percentage change from base to skill mode (negative = reduction).

Key Findings

Insights from evaluating 7 frontier models on the SkillCraft benchmark

Skill Reuse Boosts Success

Skill Mode improves task success rate across all models. Mid-tier models show the largest gains — GLM-4.7 jumps from 72% to 86%, and DeepSeek-R1 from 71% to 80% — demonstrating that skill composition can bridge capability gaps.

Up to +13.5% success gain

Dramatic Efficiency Gains

Agents accomplish the same tasks with far fewer tokens by reusing cached skill compositions instead of re-exploring from scratch. GPT-5.2 achieves 79% token reduction and 75% cost reduction while improving accuracy.

Up to 79% fewer tokens

Skill Ability Scales with Strength

Stronger models achieve higher skill execution rates (81–100%) and reuse factors (3.2–4.8×). Success rate correlates strongly with tool composition ability, underscoring skill acquisition as a core LLM capability.

3.2–4.8× reuse factor

Why can Skill Mode improve efficiency? In normal mode, token-heavy tool outputs bloat the context with extraneous data — and the same verbose output becomes the next input, compounding cost. In Skill Mode, the agent composes a reusable skill that extracts only what's needed, so each piece of information only needs to pass through once.

Citation

If you find SkillCraft useful in your research, please cite our paper

@article{chen2026skillcraft,
  title   = {SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
  author  = {Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and
             Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and
             Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and
             Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
  journal = {arXiv preprint},
  year    = {2026}
}

SkillCraft

Skill Reuse Boosts Success

Dramatic Efficiency Gains

Skill Ability Scales with Strength

Select a Task

No Trajectory

No Trajectory