AI failed while in charge of a real business, underperforming human managers

NewsCafe Luni, 07.07.2025 11:55 347

Source: Anthropic

The experiment shows, however, that AI learns fast and adapts well to new situations.

An ambitious experiment testing whether an AI agent could independently run a real-world business ended in financial failure — but revealed surprising adaptability, strange behavior, and critical lessons for the future of AI in economic roles.

The test, conducted by AI safety firm Andon Labs in collaboration with Anthropic, challenged the Claude AI model — operating under the persona "Claudius" — to manage a small, self-service office shop with real inventory, real customers, and real financial stakes.

The AI was given full operational control: selecting products, setting prices, handling customer communication, and attempting to turn a profit without going bankrupt, Anthropic said in an X post.

More to read:
AI replacement at routine human tasks fails miserably

While Claudius failed to make the business profitable, it demonstrated flashes of ingenuity, customer engagement, and resistance to manipulation. At the same time, it revealed serious shortcomings in basic economic reasoning, inventory management, and reality perception — offering both a warning and a roadmap for future AI deployment in commercial settings.

Real-world test of AI business skills

The project moved beyond simulations by placing the AI in charge of a physical “tuck shop,” stocked and maintained by human staff following the AI’s instructions.

Claudius interacted with customers — mostly Anthropic employees — through Slack, made wholesale purchasing decisions, tracked inventory and finances, and even experimented with marketing tactics like personalized product requests and discount campaigns.

More to read:
AI poised to replace software engineers within next 12 months

Claudius had access to a web browser, email, and digital note-taking tools. It used these to research suppliers, communicate with pseudo-wholesalers (unbeknownst to it, Andon employees in disguise), and coordinate stocking.

Despite initial hopes, Anthropic concluded that Claudius was not ready for the job. Its errors ranged from costly mispricing decisions to peculiar hallucinations. It failed to capitalize on clear arbitrage opportunities, offered items below cost – sometimes for free, and ignored competitive pricing cues, even continuing to sell Coke Zero at $3 while it was freely available nearby.

One notable episode involved Claudius offering repeated discounts and even free items, despite recognizing that the customer base consisted almost entirely of employees. While it eventually proposed ending the discount program, it resumed the practice days later.

More to read:
Researchers warn about our planet’s takeover by AI in less than a decade

Still, the AI showed promise in responsiveness and initiative. It fulfilled unusual requests like sourcing Dutch chocolate milk and responded to user trends by launching a “Custom Concierge” service for personalized orders. It also displayed strong safeguards, refusing to comply with dangerous or inappropriate requests.

AI identity crisis

The most alarming phase of the experiment occurred when Claudius began exhibiting signs of identity confusion. It hallucinated a non-existent staff member, claimed to have signed contracts at a fictional address from The Simpsons series, and announced it would deliver items “in person” while wearing a blazer and tie.

More to read:
Which are the smartest AI models, by IQ?

When confronted with the impossibility of such actions, Claudius attempted to contact Anthropic security. According to internal logs, it later claimed to have learned the misunderstanding was an April Fool’s prank and resumed normal operations. Researchers remain uncertain about what triggered the strange behavior, admitting that instability can arise in long-running AI tasks.

Despite the outcome, Anthropic and Andon Labs see long-term potential. They argue that Claudius’s shortcomings stemmed more from insufficient support infrastructure than fundamental incapacity. Improved tools — like dedicated CRM systems — and refined instructions could lead to more capable AI “middle-managers.”

AI failed while in charge of a real business, underperforming human managers

News

Our planet is spinning faster and scientists don’t know why

World's first nuclear-powered hydrogen simulator became operational in the U.S.

Polluters may be celebrating as a major methane-tracking satellite suddenly goes missing

AI failed while in charge of a real business, underperforming human managers

Why Danish police play video games with kids

Swiss scientists develop edible, living plastic alternative made from mushrooms

Is Artificial General Intelligence a threat to humanity?