The Alignment Problem

Book review: The Alignment Problem: Machine Learning and Human Values, by Brian Christian.

I was initially skeptical of Christian’s focus on problems with AI as it exists today. Most writers with this focus miss the scale of catastrophe that could result from AIs that are smart enough to subjugate us.

Christian mostly writes about problems that are visible in existing AIs. Yet he organizes his discussion of near-term risks in ways that don’t pander to near-sighted concerns, and which nudge readers in the direction of wondering whether today’s mistakes represent the tip of an iceberg.

Most of the book carefully avoids alarmist or emotional tones. It’s hard to tell whether he has an opinion on how serious a threat unaligned AI will be – presumably it’s serious enough to write a book about?

Could the threat be more serious than that implies? Christian notes, without indicating his own opinion, that some people think so:

A growing chorus within the AI community … believes, if we are not sufficiently careful, the this is literally how the world will end. And – for today at least – the humans have lost the game.

Is Alignment Hard?

The book mostly focuses on why it’s hard to align software with human values.

Christian portrays this as mostly an old problem, whose importance is increasing. He gives an example of a thermostat that was poorly aligned, due to measuring temperature at what ended up being the wrong location.

A broader problem is that we have trouble deciding what we mean by human values. When people try to translate our notions of fairness into the unambiguous criteria needed by software, they end up noticing that we have conflicting intuitions about what’s fair (see Inherent Trade-Offs in the Fair Determination of Risk Scores).

Modern AI systems do lots of learning by absorbing lots of real-world data. Any system that learns from a broad variety of text will discover that doctors are stereotypically male, and nurses are stereotypically female. It’s hard to use such systems without contributing to a perpetuation of those stereotypes (possibly for reasons that are similar to why humans perpetuate them).

Christian describes how to remove gender bias that a neural net associates with a word. He provides enough technical detail that I could mostly figure out how to implement the proposed fixes by myself, at least for a pretty simple system.

But the results are messy. The example system decided that “grandmothered in” was just as appropriate as “grandfathered in”. That doesn’t sound like Wikipedia’s hoped-for neutral point of view. It’s more like we’re stuck trying to choose which viewpoint to favor.

And how should we decide what words ought to be stripped of gender stereotypes?

What about a word like “rabbi”? Whether the word had an intrinsic gender dimension depended on whether the Jewish denomination in question was, say, Orthodox or Reform.

That’s just for a system that simply learns meanings of words. A system with different goals will likely have the relevant knowledge spread out in ways that are harder to find.

Weak Parts of the Book

Christian says it’s important to develop a good curriculum in order to teach an AI. This seems not quite right.

Christian explains this better than did semi-famous computer scientist Leslie Valiant in his book. Christian gives good enough examples of curriculum benefits that I was able to figure out where I agree and disagree.

Part of what they talk about is the practice of starting with simple training data and progressing up to a fully complex set of training data. I suspect this is mostly a crutch to make up for using weak systems, and that as developers throw more compute at problems, the benefits to such a curriculum will diminish to insignificant levels.

But “curriculum” also refers to the goals that a system is rewarded for achieving (or, to rephrase it in schooling terms, the tests that students are expected to pass). This part of a curriculum does seem important. My only caveat here is that it shouldn’t be interpreted as needing a wise teacher. Christian hints at why the curriculum might be mostly implicit, via the example of AlphaGo, which starts by playing against dumb opponents, and progresses to beating harder and harder opponents, as a natural byproduct of playing against itself.

Christian suggests a regulation which would enable users to see and alter any model that tries to reflect their preferences. He gives an example of a recovering alcohol addict who wants not to see ads for alcohol.

I’d like to convince Facebook not to show me ads for products that contain eggs which aren’t labeled as pasture-raised. I’d buy more food that’s advertised on Facebook if I didn’t get tired of clicking through to the ingredients list and discovering it’s unacceptable. Somehow, that’s not enough incentive for Facebook to enable me to improve its model of my desires.

Having a more sophisticated AI would solve many problems of this nature, in a better way than would regulation. But some of the problem is likely due to Facebook using patterns that are complicated enough that I wouldn’t understand them with a reasonable amount of effort. They try to have an interface that would allow me to tell them what ads I’d prefer to see, but their goal of keeping it simple prevents it from being helpful.

The regulation that Christian suggests would almost certainly require that Facebook dumb down their system further. I’m unsure where it would help the addict (a regulation targeted to that need would help, but an attempted general-purpose solution would likely end up with a cumbersome interface). It would be pretty much the opposite of what I want. In spite of all of Facebook’s problems, their ads are the best way for me to find new foods from small, health-oriented companies.

Fortunately, the book contains little advice of this nature.

Conclusion

This review is relatively short, because Christian is too cautious to make many mistakes, or to get me excited or angry. I respect the book much more than I expected to, but I find it hard to write much about his subtle, pervasive competence.

It’s very readable, while being sufficiently well-researched to mostly satisfy experts.

That makes it an excellent initial introduction to AI risks (or AI in general) for almost anyone.

Bayesian Investor Blog

Ramblings of a somewhat libertarian stock market speculator