Select a Task
Choose model, task group, and task
Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills.
We address this gap by introducing SkillCraft, a benchmark explicitly designed to stress-test agent ability to form and reuse higher-level tool compositions, which we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse.
We further propose a lightweight Skill Mode evaluation protocol that enables agents to autonomously discover, compose, cache, and execute reusable Skills — accumulating a persistent library of verified skills over time. This plug-and-play mechanism dramatically improves efficiency and success, reducing token usage by up to 80%.
| Model | Skill Exec | Reuse Factor | Success (Base) | Success (Skill) | Δ Success | Token Δ | Cost Δ |
|---|---|---|---|---|---|---|---|
| Open-Source Models | |||||||
| Kimi-K2-Thinking | 70% | 3.4× | 55/126 (44%) | 56/126 (44%) | +0.8% | -42% | -39% |
| DeepSeek-V3.2-EXP | 71% | 4.8× | 76/126 (60%) | 87/126 (69%) | +8.7% | -49% | -51% |
| DeepSeek-R1 | 62% | 3.4× | 89/126 (71%) | 101/126 (80%) | +9.5% | -30% | -24% |
| GLM-4.7 | 91% | 3.7× | 91/126 (72%) | 108/126 (86%) | +13.5% | -39% | -41% |
| Minimax-M2.1 | 100% | 3.2× | 117/126 (93%) | 119/126 (94%) | +1.6% | -11% | -8% |
| Closed-Source Models | |||||||
| GPT-5.2 | 84% | 3.8× | 109/126 (87%) | 114/126 (90%) | +4.0% | -79% | -75% |
| Gemini 3 Pro | 93% | 3.9× | 108/126 (86%) | 116/126 (92%) | +6.3% | -54% | -49% |
| Claude 4.5 Sonnet | 81% | 3.4× | 119/126 (94%) | 121/126 (96%) | +1.6% | -71% | -74% |
Skill Mode improves task success rate across all models. Mid-tier models show the largest gains — GLM-4.7 jumps from 72% to 86%, and DeepSeek-R1 from 71% to 80% — demonstrating that skill composition can bridge capability gaps.
Up to +13.5% success gainAgents accomplish the same tasks with far fewer tokens by reusing cached skill compositions instead of re-exploring from scratch. GPT-5.2 achieves 79% token reduction and 75% cost reduction while improving accuracy.
Up to 79% fewer tokensStronger models achieve higher skill execution rates (81–100%) and reuse factors (3.2–4.8×). Success rate correlates strongly with tool composition ability, underscoring skill acquisition as a core LLM capability.
3.2–4.8× reuse factor
@article{chen2026skillcraft,
title = {SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
author = {Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and
Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and
Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and
Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
journal = {arXiv preprint},
year = {2026}
}
Choose model, task group, and task
👈 Select a task from the sidebar or dropdown to view its prompt
Select a task to view its configuration