MCU: A Task-centric Framework for Open-ended Agent Evaluation in Minecraft

Team CraftJarvis
¹Institute for Artificial Intelligence, Peking University
²School of Intelligence Science and Technology, Peking University
³Institute for AI Industry Research, Tsinghua University
⁴Department of Electronic Engineering, Tsinghua University
^*Indicates Corresponding Author

Abstract

To pursue the goal of creating an open-ended agent in Minecraft, an open-ended game environment with unlimited possibilities, this paper introduces a task-centric framework named MCU for Minecraft agent evaluation. The MCU framework leverages the concept of atom tasks as fundamental building blocks, enabling the generation of diverse or even arbitrary tasks. Within the MCU framework, each task is measured with six distinct difficulty scores (time consumption, operational effort, planning complexity, intricacy, creativity, novelty). These scores offer a multi-dimensional assessment of a task from different angles, and thus can reveal an agent's capability on specific facets. The difficulty scores also serve as the feature of each task, which creates a meaningful task space and unveils the relationship between tasks. For efficient evaluation of Minecraft agents employing the MCU framework, we maintain a unified benchmark, namely SkillForge, which comprises representative tasks with diverse categories and difficulty distribution. We also provide convenient filters for users to select tasks to assess specific capabilities of agents. We show that MCU has the high expressivity to cover all tasks used in recent literature on Minecraft agent, and underscores the need for advancements in areas such as creativity, precise control, and out-of-distribution generalization under the goal of open-ended Minecraft agent development.

Task Difficulty

To build an open-ended agent for Minecraft, it is imperative that we analyze the current gap between the existing agents and the ultimate open-ended agent, and then improve the agents accordingly. To be specific, we have to assess the areas where they exhibit shortcomings, e.g., whether they fall short in environment perception, subgoal planning, or precise control. This leads to a disentangled difficulty analysis of tasks.

The difficulty distributions of existing benchmarks and SkillForge. Darker shades indicate a higher concentration of tasks with the corresponding difficulty scores, while lighter shades signify the converse trend.

Left:: Data structure for an Atomic Task: "Craft an iron_sword". In this figure, we provide a concise textual description of the task and list the pre-requisite tasks without displaying the full dependency graph for simplicity.
Right:: Visualization of the SkillForge Task Space (filtered) using t-SNE. Each task is depicted by a difficulty feature vector. Notice the clustering effect in the space, where tasks of the same category tend to group together (except for the molecule tasks).

The CraftJarvis Series

GROOT: Learning to Follow Instructions by Watching Gameplay Videos
(Team CraftJarvis)

This work proposes to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations, and implements the agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers.

Describe, Explain, Plan and Select: Interactive Planning
with Large Language Models Enables Open-World Multi-Task Agents
(Team CraftJarvis)
Neurips 2023 | ICML 2023 TEACH Workshop (Best Paper Award)

DEPS is an interactive planning approach based on Large Language Models (LLMs). It helps with better error correction from the feedback during the long-haul planning, while also bringing the sense of proximity via goal Selector, a learnable module that ranks parallel sub-goals based on the estimated steps of completion and improves the original plan accordingly.