Corrigibility Scales To Value Alignment

Epistemic status: speculation with a mix of medium confidence and low confidence conclusions.

I argue that corrigibility is all we need in order to make an AI permanently aligned to a principal.

This post will not address how hard it may be to ensure that an AI is corrigible, or the conflicts associated with an AI being corrigible to just one principal versus multiple principals. There may be important risks from AI being corrigible to the wrong person / people, but those are mostly outside the scope of this post.

I am specifically using the word corrigible to mean Max Harms’ concept (CAST).

Max Harms writes that corrigibility won’t scale to superintelligence:

A big part of the story for CAST is that safety is provided by wise oversight. If the agent has a dangerous misconception, the principal should be able to notice this and offer correction. While this might work in a setting where the principal is at least as fast, informed, and clear-minded as the agent, might it break down when the agent scales up to be a superintelligence? A preschooler can’t really vet my plans, even if I genuinely want to let the preschooler be able to fix my mistakes.

I don’t see it breaking down. Instead, I have a story for how, to the contrary, a corrigible AI scales fairly naturally to a superintelligence that is approximately value aligned.

A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions. Probably that eventually becomes indistinguishable from a value-aligned AI.

It might still differ from a value-aligned AI if a principal’s instructions differ from what the principal values. If the principal can’t learn to minimize that with the assistance of an advanced AI, then I don’t know what to recommend. A sufficiently capable AI ought to be able to enlighten the principal as to what their values are. Any alignment approach requires some translation of values to behavior. Under corrigibility, the need to solve this seems to happen later than under other approaches. It’s probably safer to handle it later, when better AI assistance is available.

Corrigibility doesn’t guarantee a good outcome. My main point is that I don’t see any step in this process where existential risks are reduced by switching from corrigibility to something else.

Vetting

Why can’t a preschooler vet the plans of a corrigible human? When I try to analyze my intuitions about this scenario, I find that I’m tempted to subconsciously substitute an actual human for the corrigible human. An actual human would have goals that are likely to interfere with corrigibility.

More importantly, the preschooler’s alternatives to corrigibility suck. Would the preschooler instead do a good enough job of training an AI to reflect the preschooler’s values? Would the preschooler write good enough rules for a Constitutional AI? A provably safe AI would be a good alternative, but the feasibility of that looks like a long-shot.

Now I’m starting to wonder who the preschooler is an analogy for. I’m fairly sure that Max wants the AI to be corrigible to the most responsible humans until some period of acute risk is replaced by a safer era.

Incompletely Corrigible Stages of AI

We should assume that the benefits of corrigibility depend on how reasonable the principal is.

I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.

That means a sensible principal would give high priority to achieving full corrigibility, while being careful to avoid exposing the AI to unusual situations. The importance of avoiding unusual situations is an argument for slowing or halting capabilities advances, but I don’t see how it’s an argument for replacing corrigibility with another goal.

When in this path to full corrigibility and full ASI does the principal’s ability to correct the AI decrease? The principal is becoming more capable, due to having an increasingly smart AI helping them, and due to learning more about the AI. The principal’s trust in the AI’s advice ought to be increasing as the evidence accumulates that the AI is helpful, so I see decreasing risk that the principal will do something foolish and against the AI’s advice.

Maybe there’s some risk of the principal getting in the habit of always endorsing the AI’s proposals? I’m unsure how to analyze that. It seems easier to avoid than the biggest risks. It still presents enough risk that I encourage you to analyze more deeply than I have so far.

Jeremy Gillen says in The corrigibility basin of attraction is a misleading gloss that we won’t reach full corrigibility because

The engineering feedback loop will use up all its fuel

This seems like a general claim that alignment is hard, not a claim that corrigibility causes risks compared to other strategies.

Suppose it’s really hard to achieve full corrigibility. We should be able to see this risk better when we’re assisted by a slightly better than human AI than we can now. With a better than human AI, we should have increased ability to arrange an international agreement to slow or halt AI advances.

Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.

Complexity

Max writes:

Might we need a corrigible AGI that is operating at speeds and complexities beyond what a team of wise operators can verify? I’d give it a minority—but significant—chance (maybe 25%?), with the chance increasing the more evenly/widely distributed the AGI technology is.

I don’t agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI’s actions.

A corrigible AI or human would be careful to give the preschooler the best practical advice about the consequences of decisions. It would work much like consulting a human expert. E.g. if I trust an auto mechanic to be honest, I don’t need to understand his reasoning if he says better wheel alignment will produce better fuel efficiency.

Complexity does limit one important method of checking the AI’s honesty. But there are multiple ways to evaluate honesty. Probably more ways with AIs than with humans, due to our ability to get some sort of evidence from the AI’s internals. Evaluating honesty isn’t obviously harder than what we’d need to evaluate for a value-aligned AI.

And again, why would we expect an alternative to corrigibility to do better?

Speed

A need for hasty decisions is a harder issue.

My main hope is that if there’s a danger of a race between multiple AGIs that would pressure AGIs to act faster than they can be supervised, then the corrigible AGIs would prioritize a worldwide agreement to slow down whatever is creating such a race.

But suppose there’s some need for decisions that are too urgent to consult the principal? That’s a real problem, but it’s unclear why that would be an argument for switching away from corrigibility.

How plausible is it that an AI will be persuasive enough at the critical time to coordinate a global agreement? My intuition that the answer is pretty similar regardless of whether we stick with corrigibility or switch to an alternative.

The only scenario that I see where switching would make sense is if we know how to make a safe incorrigible AI, but not a corrigible AI that’s fully aligned. That seems to require that corrigibility be a feature that makes it harder to incorporate other safety features. I’m interested in reading more arguments on this topic, but my intuition is saying that keeping corrigibility is at least as safe as any alternative.

Conclusion

Corrigibility appears to be a path that leads in the direction of full value alignment. No alternative begins to look better as the system scales to superintelligence.

Most of the doubts about corrigibility seem to amount to worrying that a human will be in charge, and humans aren’t up to the job.

Corrigible AI won’t be as safe as I’d like during the critical path to superintelligence. I see little hope of getting something safer.

To quote Seth Herd:

I’m not saying that building AGI with this alignment target is a good idea; indeed, I think it’s probably not as wise as pausing development entirely (depending on your goals; most of the world are not utilitarians). I’m arguing that it’s a better idea than attempting value alignment. And I’m arguing that this is what will probably be tried, so we should be thinking about how exactly this could go well or go badly.

I’m slightly less optimistic than Seth about what will be tried, but more optimistic about how corrigibility will work if tried by the right people.

Bayesian Investor Blog

Ramblings of a somewhat libertarian stock market speculator