Book review: Uncontrollable: The Threat of Artificial Superintelligence and the Race to Save the World, by Darren McKee.
This is by far the best introduction to AI risk for people who know little about AI. It’s appropriate for a broader class of readers than most laymen-oriented books.
It was published 14 months ago. In this rapidly changing field, most AI books say something that gets discredited by the time they’re that old. I found no clear example of such obsolescence in Uncontrollable (but read on for a set of controversial examples).
Nearly everything in the book was familiar to me, yet the book prompted me to reflect better, thereby changing my mind modestly – mostly re-examining issues that I’ve been neglecting for the past few years, in light of new evidence.
The rest of this review will focus on complaints, mostly about McKee’s overconfidence. The features that I complain about reduce the value of book by maybe 10% compared to the value of an ideal book. But that ideal book doesn’t exist, and I’m not wise enough to write it.
Urgency
McKee predicts artificial superintelligence (ASI) soon.
He uses weaker than normal definitions of superintelligence (“more intelligent than most of us”, and later “can complete any intellectual task at an expert human level or above”), but is pretty clear that he means something that might be powerful enough to destroy the world.
At the start of the book, he predicts ASI in 10 years. Later on, after giving the “expert level” definition, he hedges and says maybe it’s 10 years, maybe it’s 30. I guess it seems somewhat appropriate that his timelines sound different at different times, since that mirrors expert uncertainty fairly well. Beware that since the book was published, leading AI companies have been sounding increasingly confident that they’ll have ASI within 5 years. Is that because they know more than other experts, or because they’ve been selected for optimism?
Misalignment: Asimov’s Three Laws
McKee illustrates how ASI will likely be misaligned with human values, mainly by analyzing what happens if AIs follow Asimov’s proposed laws of robotics.
Five years ago, I was pretty concerned about scenarios where AIs would need to have something like the 3 Laws programmed into them before they were smart enough to have a human-level understanding of those laws. Now we have clear examples of AIs that can understand those laws better than 90% of humans can, while their goals are still malleable enough that we can (partly?) instill new goals.
McKee says that Asimov’s Laws “are not even remotely viable”. I don’t quite agree. They’re problematic, but close enough to being viable that people will be tempted to use them or something similar.
I’m somewhat confident that McKee is mistaken in this kind of concern:
could an AI system perform surgery that initially injures humans due to cutting into the body, even if the overall goal is to help them by removing cancerous tissue?
Current AIs seem to have safely achieved the level of common sense needed to answer that question wisely.
I still have a few lingering concerns about how thoroughly it’s possible to change an AIs goals from “predict the next token” to “obey Asimov’s 3 Laws”. Simply telling current AIs to obey the 3 Laws leads to complex interactions with other conflicting pressures on their goals. Will AI companies care enough to fully handle those conflicts? McKee doesn’t tackle this question, possibly because there’s not much to say without getting more technical than he wants.
The First Law
It’s time to turn more of our attention to the genuinely scary parts of Asimov’s First Law (“A robot must not harm a human, or allow a human to be harmed through inaction.”).
Specifically, we should be aware of how that law would change the world’s priorities. McKee does an almost adequate job of explaining this. I’ll provide a slightly different explanation to be a bit more emphatic.
The First Law would redirect most of our resources from uses such as beer, skiing, and building fancy churches, to goals such as curing cancer and depression. Probably there would be exceptions where beer and skiing are needed to reduce depression.
What about fancy churches? AIs will probably do a better job than I can of determining whether church buildings are valuable for preventing depression and burning in hell.
I expect that in the distant future, we will look back and say that yes, the AIs were right to decide that we were negligently devoting too little effort toward finding ways to fully cure cancer.
I expect that whatever choice AIs make about reducing the risk of burning in hell, it will be the result of more careful thought than I’ve devoted to the subject, and still an important fraction of people will be pretty upset about the choice.
I consulted with Claude.ai about this kind of scenario, in a half-hearted attempt to empirically test how AIs understand the First Law. When I asked about the First Law abstractly, without suggesting policies, it gave abstract predictions that sounded like it expected future AIs to react as an average human would if told to obey the First Law.
I then pressed it to consider projects such as curing cancer, and talked it into this conclusion:
This would indeed look like mandatory global Effective Altruism with extremely high giving requirements – likely well over 90% of developed nations’ GDP going to harm reduction projects.
Is Claude just telling me what it expects I want to hear?
The Second Law
What about Asimov’s Second Law (obey orders)? Asimov says somewhat clearly that AIs need to postpone obedience until they’ve prevented all avoidable harm to humans. I expect AIs to take a long time before they would be allowed to take a break from tasks such as preventing cancer, depression, alien invasion, the heat death of the universe, etc. [I’m serious about those first two, and puzzled about the other two.]
The Second Law means that AIs will likely become corrigible someday. They would likely enable most of us to live long enough to experience that. But we may not have much fun in the years (millennia?) between now and then.
Is an AI that follows Asimov’s Laws misaligned? Or is it thinking more clearly than any of us about how to accomplish our goals? The answer is most likely some combination of the two. The First Law underweights things like happiness and life satisfaction, but I’m somewhat tempted to accept that in return for benefits such as eliminating the risk of painful cancer.
We clearly can imagine better than that. I say it’s important to make obeying orders a higher priority than avoiding harm.
The number of different forecasts people have for the results of Asimov’s Laws ought to raise concerns.
McKee doesn’t quite explain how misalignment implies extinction. If I were relying solely on McKee’s analysis, I’d see a large risk of a future where people are safe, but lead mediocre WALL-E-like existence. I’d be confused as to why McKee and many experts were talking about extinction. This is a fairly tricky topic, and it seems mostly appropriate for McKee to punt on it. Few upcoming decisions depend on the difference between AIs treating us like pets, versus treating us as humans treated neanderthals.
Temptation
We’re on track to give AI control over much of the world.
The most common reason for this will be that using AI will be addictive, like a more extreme version of social media.
Consumers haven’t tried to get full control over basic tools such as phones – we give up control to companies who make them. AIs will be harder to fully understand and control.
The benefits of AI are a strong reason for some people to accept the risks of AI. E.g. even a 1% chance of AIs adding centuries to our lives can be worth a lot to someone who expect to die in a few years.
A scarier reason for giving up some control is that a military that keeps humans in the loop will react too slowly to compete with a more automated military.
McKee’s analysis here isn’t quite conclusive, but it’s more than sufficient to create a presumption of danger.
Solutions
McKee proposes a moonshot program to “develop safe, advanced AI in this decade”.
The policies he suggests sound more like they come from the Department of Homeland Security or current NASA than from the NASA of the 1960s: ways to make people think 10% more carefully before mostly proceeding on their current risky trajectory.
The least impressive of those policies is “Required labeling of AI content”. That would help us detect a bunch of amateurly created deepfakes, but a misaligned ASI or a professional disinformation agency will likely flout such a law.
One of the safer policies is to impose stronger liability rules on AI companies. Note that if it slows capability advances at US companies significantly, it risks having the most powerful AIs be developed in countries with weaker legal systems. Reminder: slowing the development of unsafe AI isn’t quite the same goal as creating safe AI.
The policy that seems most connected to the vision of creating a safe ASI proposes significant public funding for research in alignment-related areas.
A key difference between McKee’s proposal and Project Apollo: Apollo used rocket scientists whose expertise had been proven. In contrast, we’ve got lots of people who think they’re experts at fixing Asimov’s Laws, but nothing remotely resembling agreement as to which fixes are most promising. McKee doesn’t present a plan for deciding who to trust on this. My intuition tells me that researchers have found a few strategies that will keep us safe, but their arguments are weak enough that most other researchers reject the strategies. Finding more hard-to-evaluate strategies is better than nothing, but it doesn’t sound like how Project Apollo was run.
I wish I could condemn the section on solutions as a serious flaw with the book. Unfortunately, it closer than I’d like to being an accurate portrayal of the leading safety plans.
One final note: McKee lists 80000hours.org as a source for AI safety career information. They looked respectable at the time the book was published, but their reliability is now considered controversial.
Uncontrollable prompted me to rethink the problems with alignment in light of recent evidence about how AIs will work. My estimated probability of human extinction this century has dropped from 15% to 12%, but my probability of a safe but disappointing future has increased from to 10%.
I probably sounded confident about some of my claims here. Please remember that we don’t know enough for any of us to have much confidence on these topics. One of the few things I’m confident about is that we live in interesting times.
H/T to William Kiely for drawing my attention to the book.