TL;DR:
- Corrigibility is a simple and natural enough concept that a prosaic AGI can likely be trained to obey it.
- AI labs are on track to give superhuman(?) AIs goals which conflict with corrigibility.
- Corrigibility fails if AIs that have goals which conflict with corrigibility.
- AI labs are not on track to find a safe alternative to corrigibility.
This post is mostly an attempt to distill and rewrite Max Harm’s Corrigibility As Singular Target Sequence so that a wider audience understands the key points. I’ll start by mostly explaining Max’s claims, then drift toward adding some opinions of my own.
Continue Reading