Illustration: John Watson

Can we make AI safe?

Apollo Research’s Marius Hobbhahn explains how to avoid a ‘disaster’ at the Bletchley Park AI summit
November 1, 2023

“Lights out for all of us.” That, as per Sam Altman, CEO of OpenAI, is the fate that awaits humanity if things go badly with artificial intelligence. What makes this comment more unsettling is that it expresses an orthodox view. The CEO of Anthropic, another leading AI research company, recently said that the chance that something goes “really quite catastrophically wrong” for humanity is “somewhere between 10 and 25 per cent.”

Since 2016, AI has gone from being barely able to distinguish a chihuahua from a muffin to outclassing humans in exams, recognising speech, making art and playing strategic games. The challenge, as the leading labs acknowledge, is to work out how to make advanced AI systems safe.

A small number of researchers have taken that challenge upon themselves. They say the leading labs, despite knowing the risks, are hurtling towards uncharted territory in their haste to reap the spoils of super intelligence. These researchers comprise the fledgling but significant field of AI safety. 

Marius Hobbhahn, 27, is one of the field’s leaders. I meet him on a dour Monday afternoon at a co-working space in Whitechapel. This is the home of Apollo Research, the AI safety organisation led by Hobbhahn, as well as several other AI safety teams.

We’re chatting not long before the UK-led AI safety summit at Bletchley Park at the start of November. There, in the room with CEOs, diplomats, Rishi Sunak and Kamala Harris, Hobbhahn will make the case for muscular and technically informed regulation designed to rebalance, in favour of safety, the incentives driving AI development.

By the time you read this, the summit will have taken place, and we will know if anything came of it. “There’s a world in which everybody goes there, shakes hands, smiles, photographs and goes home, and nothing happens,” says Hobbhahn. “And I think this would be a really big disaster, honestly, because it would feel like giving away a huge opportunity to come together and do something properly.”

Hobbhahn grew up in Nuremberg, Germany, before moving to the UK to work on AI safety. He was a garlanded university debater, but he speaks plainly. He doesn’t want to disclose his work schedule, lest he set a damaging example to those whose capacity for long hours does not match his, but my impression is of a man who takes Apollo’s mission seriously. “AI going wrong is a pretty strong motivation,” he says.

Under his leadership, Apollo is attempting to solve the problem of alignment: that is, ensuring that powerful AI agents pursue their given goals without causing harm. An obvious approach is to train AIs to tell the truth, but that is difficult to do without inadvertently training them simply to tell us what we want to hear.

To solve this problem, Apollo and others are working on something called interpretability. “We’re trying to get better at understanding what the hell is going on inside of AI, so that’s interpretability—and also how they behave, which is maybe more like psychology. You’re looking at the behaviour and you try to make inferences from that behaviour, and you basically treat the AI mostly as a black box.” Both of these approaches, says Hobbhahn, “seem pretty valuable. And so we’re investing in both.”

Apollo is not named after the Moon missions, Hobbhahn explains, but primarily “the god Apollo. He has a lot of different purposes. But truth and light were the ones that we felt were pretty compelling for an organisation that works on AI, deception and interpretability, where interpretability is a bit like bringing light into the darkness of the network, light into the black box.”