MCU: A Task-centric Framework for Open-ended Agent Evaluation in Minecraft

Team CraftJarvis
1Institute for Artificial Intelligence, Peking University

2School of Intelligence Science and Technology, Peking University

3Institute for AI Industry Research, Tsinghua University

4Department of Electronic Engineering, Tsinghua University

*Indicates Corresponding Author

Abstract

To pursue the goal of creating an open-ended agent in Minecraft, an open-ended game environment with unlimited possibilities, this paper introduces a task-centric framework named MCU for Minecraft agent evaluation. The MCU framework leverages the concept of atom tasks as fundamental building blocks, enabling the generation of diverse or even arbitrary tasks. Within the MCU framework, each task is measured with six distinct difficulty scores (time consumption, operational effort, planning complexity, intricacy, creativity, novelty). These scores offer a multi-dimensional assessment of a task from different angles, and thus can reveal an agent's capability on specific facets. The difficulty scores also serve as the feature of each task, which creates a meaningful task space and unveils the relationship between tasks. For efficient evaluation of Minecraft agents employing the MCU framework, we maintain a unified benchmark, namely SkillForge, which comprises representative tasks with diverse categories and difficulty distribution. We also provide convenient filters for users to select tasks to assess specific capabilities of agents. We show that MCU has the high expressivity to cover all tasks used in recent literature on Minecraft agent, and underscores the need for advancements in areas such as creativity, precise control, and out-of-distribution generalization under the goal of open-ended Minecraft agent development.

Compose arbitrary tasks with atom task

MY ALT TEXT

The taxonomy of atom task, molecule task, and benchmark in MCU framework. Atom tasks are the basic components of MCU, which can test a minimal ability of Minecraft agent by resolving prerequisite dependencies for the task. By initializing the agents with custom conditions (e.g., craft iron ingot from scratch), adding constraints to the task (e.g., combat a spider using arrow), or combining different tasks (e.g., plant pumpkin and then carve pumpkin), atom tasks can be exploited to generate infinite molecule tasks. By collecting a list of tasks, custom benchmarks can be created to test particular facets of agents' capabilities. They serve as targeted evaluation tools, helping us gain insights into the agent's performance in particular domains or scenarios.
With the task generation approach, MCU can generate infinite benchmarks with arbitrary task lists. To provide a unified benchmark for researchers to test their agents and improve the fairness of comparing different baselines on the same settings, we maintain a benchmark created by MCU named SkillForge. The benchmark comprises representative tasks from MCU which are selected to cover a wide spectrum of categories and diverse difficulty distribution.

Task Difficulty

To build an open-ended agent for Minecraft, it is imperative that we analyze the current gap between the existing agents and the ultimate open-ended agent, and then improve the agents accordingly. To be specific, we have to assess the areas where they exhibit shortcomings, e.g., whether they fall short in environment perception, subgoal planning, or precise control. This leads to a disentangled difficulty analysis of tasks.

Minecraft Textworld

MY ALT TEXT


Minecraft Textworld is a light-weight environment to test the ability of finding a feasible task chain towards the target tasks by resolving pre-tasks. Given a target task, the agent has to choose the candidate actions in each round to achieve the final task. Each candidate ation is an atom task that can be solved with current conditions (the dependency of the action is solved).
Minecraft Textworld is designed to simplify the Minecraft gaming environment by removing the need for low-level control. While it excels in evaluation efficiency, it is important to underscore that pure planning is not sufficient to supplant the original Minecraft environment, given the vital role that control plays in agent development. Instead, we envision that Minecraft Textworld can serve as a complementary tool, providing researchers with a user-friendly platform to assess and refine their agents through swift, iterative testing.

The CraftJarvis Series

BibTeX


      @article{lin2023mcu,
        title={MCU: A Task-centric Framework for Open-ended Agent Evaluation in Minecraft},
        author={Lin, Haowei and Wang, Zihao and Ma, Jianzhu and Liang, Yitao},
        journal={arXiv preprint arXiv:2310.08367},
        year={2023}
      }