AI learns how to play Minecraft by watching videos

Open AI trained an AI neural network to play Minecraft using pre-video training (VPT) on a huge array of unlabeled video data of a human playing Minecraft, while using only a small amount of labeled contractor data.

With a bit of tweaking, the AI R&D company is confident its model can learn to create diamond tools, a task that typically takes more than 20 minutes (24 actions) for experienced people. The model uses a native human interface of keystrokes and mouse movements, making it generic enough to be a step towards creating agents that use computers.

A spokesman for the Microsoft-backed firm said: “The internet has a huge amount of public video that we can learn from. You can watch a person making a great presentation, a digital artist painting a beautiful sunset, a Minecraft player building an intricate house. However, these videos only provide a record of what happened, not an exact description of how it was achieved, meaning you won't recognize the exact sequence of mouse movements and keystrokes.

“If we want to build large-scale foundation models in these areas, as we did in the language with GPT, then the lack of action labels creates a new problem that does not exist in the language domain, where “action labels” are just the next words in a sentence.”

In order to take advantage of the wealth of unlabeled video data available on the web, Open AI introduces a new, yet simple, semi-supervised simulation training method: Pre-Video Training (VPT). The team starts by collecting a small dataset from contractors, where it records not only their videos, but also their actions, which in this case are keystrokes and mouse movements. With this data, a company can train an Inverse Dynamics Model (IDM) that predicts the action taken at each stage of the video. It is important to note that IDM can use past and future information to guess the action at each step.

The representative added: “This task is much simpler and requires much less data than the behavior cloning task, which is to predict actions only from past frames of a video, which requires inference about what the person wants to do and how to do it. We can then use the trained IDM to label a much larger online video dataset and learn how to act with behavioral cloning.”

According to Open AI, the VPT is paving the way for agents to learn how to act by watching vast amounts of videos online.

A spokesman for the company said: “Compared to generative video modeling or contrast methods that only produce representative judgments, VPT offers an exciting opportunity to directly teach large-scale behavioral judgments in more areas than just language. Although we have only done AI experiments in Minecraft, the game is very open and the native human interface (mouse and keyboard) is very generic, so we think our results are well suited to other similar areas, such as computer use."

Popular

AI learns how to play Minecraft by watching videos

Release date Steam Deck 2, price, specifications and rumors

Unlocking Potential: AMD Radeon RX 8900 XTX Leaked

Recommended

Nvidia's new AI assistant app: ChatRTX

Release date Steam Deck 2, price, specifications and rumors

Unlocking Potential: AMD Radeon RX 8900 XTX Leaked

The 7 Biggest Differences Between Netflix's Fawn and the Real Story

Popular

Other news

Release date Steam Deck 2, price, specifications and rumors

Unlocking Potential: AMD Radeon RX 8900 XTX Leaked

Recommended

Nvidia's new AI assistant app: ChatRTX

Release date Steam Deck 2, price, specifications and rumors

Unlocking Potential: AMD Radeon RX 8900 XTX Leaked

The 7 Biggest Differences Between Netflix's Fawn and the Real Story