Can classic video games, such as Super Mario Bros, be effective benchmarks for AI advancement? Here's what the experts think.
Researchers at the University of California San Diego's Hao AI Lab recently pitted various AI models against Super Mario Bros games, revealing intriguing results.
The experiment utilised an emulated version of Super Mario Bros integrated with GamingAgent (a framework developed in-house by Hao AI Lab) to provide AI models with basic instructions and in-game screenshots, requiring them to generate Python code inputs to control Mario.
The researchers found that this unique gaming environment forced each AI model to devise complex manoeuvres and gameplay strategies, testing their adaptability and problem-solving skills.
Interestingly, "reasoning" models like OpenAI's o1, which excel at step-by-step problem-solving, underperformed compared to "non-reasoning" models. This is because one of the key factors in the AI models' performance was their decision-making speed.
In Super Mario Bros, split-second timing can mean the difference between successfully clearing a jump and plummeting to defeat. The researchers noted that “reasoning” models often required seconds to decide on actions, a significant disadvantage in a fast-paced game.
The use of games as a benchmark to test AI capabilities is not entirely new. The phenomenon can even be traced back to the 1950s with chess-playing programs such as Claude Shannon’s. And just last month, an ongoing experiment was launched to see how Anthropic's latest AI model, Claude 3.7 Sonnet, would fare in the game Pokémon Red, with the event livestreaming on Twitch.
The Pokémon benchmark involved AI models navigating the game world, engaging in turn-based battles, and making strategic decisions based on the game's complex mechanics. This test evaluated an AI model's ability to understand and apply game rules, manage resources, and develop long-term strategies.
However, the Super Mario Bros benchmark introduces new challenges:
While using games as benchmarks for AI development is undoubtedly one of the more fun and relatable ways to test progress, there is an ongoing debate about the relevance of these tests in determining overall technological advancement, with critics arguing that success in certain games doesn't necessarily demonstrate intelligence. As Richard Socher, former Salesforce chief AI scientist, told VentureBeat in 2022: "Once AI could solve chess, it didn't really become smarter than people — it just got good at chess."
Others argue that using video games as benchmarks has less to do with research and more to do with generating PR: "If the public wasn't interested in these flashy "milestones” that are so easy to misrepresent as steps toward superhuman general AI, researchers would be doing something else," François Chollet, AI researcher and software engineer at Google, told the Verge in 2019.
Chollet claims that the idea of success in games as a good measure of an AI model's intelligence stems from anthropomorphisation. In humans, we might assume that a good chess player has a high level of general intelligence because we understand that they developed that skill over time using their intelligence.
"They weren’t designed to play chess. So we know they could direct this general intelligence to many other tasks and learn to do these tasks similarly efficiently," says Chollet. But this assumption doesn't apply when it comes to AI models, because they can be designed especially for that particular skill.
With games like Super Mario Bros introducing new challenges to AI models (that weren't trained specifically for the game), it remains to be seen whether these benchmarks will be able to reflect broader advancements in AI or remain a captivating, yet limited, measure of progress.