Automatic, Efficient, and General Agent Evaluation
Evaluating LLM-based agents is becoming increasingly important as these systems grow more capable and complex. However, the current evaluation landscape is highly fragmented, costly, and often focused on domain-specific tasks. In this talk, I present a line of work aimed at making agent evaluation more automatic, efficient, and general.
In our survey on the evaluation of LLM-based agents, we highlight key gaps in the field [1]. Building on this analysis, the Agentic CLEAR framework and package introduce automated, fine-grained evaluation of agent traces across multiple levels [4]. To address the high cost of benchmarking agents, we propose an approach for efficient agent evaluation using difficulty-based splits, which significantly reduces evaluation cost while maintaining reliability [5]. Finally, we argue in a position paper that agentic systems should be general [3], and introduce a framework for benchmarking such systems, namely General Agent Evaluation [2].
References
[1] Survey on Evaluation of LLM-based Agents (https://arxiv.org/abs/2503.16416)
[2] General Agent Evaluation (https://arxiv.org/abs/2602.22953)
[3] Position: Agentic Systems Should be General (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6176178)
[4] Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents (under review)
[5] Efficient Agent Evaluation using Wisdom of the Crowds (to be submitted to COLM)