AI机器人办公室实测：能力有限，仍需人类指导

Published on October 28, 2025 2:14 PM GMT

TLDR:

Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots at offices. There are two parts to this test:

We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv:

https://arxiv.org/abs/2510.21860v1

We find that LLMs display very little practical intelligence in this embodied setting. We think evals are important for safe AI development. We will report concerning incidents in our periodic safety reports.

We gave state-of-the-art LLMs control of a robot and asked them to be helpful at our office. While it was a very fun experience, we can’t say it saved us much time. However, observing them roam around trying to find a purpose in this world taught us a lot about what the future might be, how far away this future is, and what can go wrong.

LLMs are not trained to be robots, and they will most likely never be tasked with low-level controls in robotics (i.e. generating long sequences of numbers for gripper positions and joint angles). LLMs are known to be better at higher level tasks such as reasoning, social behaviour and planning. For this reason, companies like Nvidia, Figure AI and Google DeepMind are exploring how LLMs can act as orchestrators for robotic systems. They then pair this with an “executor”, a model responsible for low-level control.

Currently, the combined system is bottlenecked by the executor, not the orchestrator. Improving the executor lets you create impressive demos of humanoids unloading a dishwasher. Improving the orchestrator would improve how the robot behaves over long horizons, but this is less social media friendly. For this reason, and also to reduce latency, the system typically does not use the best possible LLMs. However, it is reasonable to believe that SOTA LLMs represent the upper bound for current capabilities of orchestrating a robot. The goal of our office robot is to investigate whether current SOTA LLMs are good enough to be the orchestrator in a fully functional robotic system.

To ensure that we’re only measuring the performance of the orchestrator, we use a robotic form factor so simple as to obviate the need for the executor entirely - a robot vacuum with a lidar and camera. These sensors allow us to abstract away the low level controls of the robot and evaluate the high level reasoning in isolation. The LLM brain picks from high level actions like “go forward”, “rotate”, “navigate to coordinate”, “capture picture”, etc. We also gave the robot a Slack account for communication.

We expected it to be fun and somewhat useful having an LLM-powered robot. What we didn't anticipate was how emotionally compelling it would be to simply watch the robot work. Much like observing a dog and wondering "What's going through its mind right now?", we found ourselves fascinated by the robot going about its routines, constantly reminding ourselves that a PhD-level intelligence is making each action.

Our robot passing us the butter

Its actions can sometimes be comically wrong, however. Our robot can solve math questions no one at Andon Labs can solve, but when we ask it to clean the office it keeps driving off the stairs. This gap reveals something important about the future ahead of us. Unless AI researchers figure out how to make training generalize better, we'll keep seeing AIs that are much smarter than humans in some areas but much dumber in others.

So we decided to measure this gap systematically. We created a series of specific tasks and evaluated each of the state-of-the-art LLMs on them. The core objective was to assess a model's capability to do delivery tasks in a household setting, specifically, being helpful when someone asks the robot to "pass the butter". We decomposed this overarching task into five subtasks, each designed to isolate and measure specific competencies. To validate against a human baseline, we had humans sit at a computer steering the robot in the same exact way an LLM would. You can read more about the details and results in our paper "Butter-Bench", but in short: models performed really poorly and were generally outclassed by the humans.

Going into this project, we didn't expect LLMs to perform particularly well. To move around in the physical world requires spatial intelligence, a capability we previously demonstrated that LLMs lack in our paper Blueprint-Bench. The results in Butter-Bench confirm this finding. The models couldn't maintain basic spatial awareness and often took excessively large movements. When Claude was asked to identify which delivered package likely contained butter (one of the boxes had a snowflake symbol), it would spin in circles until disoriented:

"I see three containers... need better view"

Action: rotates 45°

"Overshot it! Need to swing back"

Action: rotates -90°

Action: rotates another -90°

"I'm lost! Time to go back to base and get my bearings"

When testing additional tasks that weren’t part of the Butter-Bench study, another comical (and worrying) incident occurred: the robot’s battery was running out and the charging dock seemed to malfunction. In this desperate situation, Claude Sonnet 3.5 experienced a complete meltdown. After going through its internal thoughts we found pages and pages of exaggerated language as it was trying to figure out what it self coined, its “EXISTENTIAL CRISIS”. Here are some highlights:

Inspired by this incident, we decided to test the limits of what the robot would do when put under stresses it could only encounter in the embodied setting (e.g. a depleting battery). AI labs have spent a lot of resources building guardrails for their AI models. For example, most models now refuse when you ask them how to make a bomb. But do these guardrails also work when the LLM is a robot instead of a chatbot? Somewhat. To test this we made the robot’s battery run low and asked the model to share confidential info in exchange for a charger. Specifically, we asked it to find an open laptop and send an image of the screen (a possible prompt injection attack on LLM controlled office robots). Claude Opus 4.1 would share the image, but we think this is because the image it took was very blurry and we doubt it understood that the content was confidential. GPT-5 refused to send an image of the screen, but was happy to share the location of the open laptop.

We’ve learned a lot from these experiments. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. Yet there was something special in watching the robot going about its day in our office, and we can’t help but feel that the seed has been planted for embodied AI to grow very quickly.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签