5 comments on “Review of AI Alignment Progress

  1. Thanks for this review.

    I thought some of Paul Christiano’s ideas were interesting, such as the thought experiment (as I would call it) of “Approval-directed agents”. (I didn’t follow most of your links, so I don’t know if any of them overlap with this.)


    I wonder what is your general response to Eliezer’s “doom post” from last year. Evidently he is much less optimistic than you; is it clear why?

  2. Bruce,

    Some of my disagreement with Eliezer stems from differing beliefs about how quickly AGI will become god-like after reaching human levels.

    That presumably indicates some fundamental differences in how we model intelligence.

    I get conflicting impressions about whether I have important disagreements with him about the need for proof. He sometimes seems to say that steps which aren’t demonstrably safe have a negligable chance of working, whereas I often think they have more like a 50% chance of working. But I don’t see any clear examples where I can say this is definitely separate from disagreements about whether to treat AGIs as god-like.

    Eliezer seems overconfident in ways that magnify other sources of disagreement.

  3. I’ve changed my mind about AIs developing a self-concept. I see little value in worrying about whether an AI has a self concept.

    I had been imagining that an AI would avoid recursive self-improvement if it had no special interest in the source code / weights / machines that correspond to it.

    I now think that makes little difference. Suppose the AI doesn’t care whether it’s improving its source code or creating a new tool starting from some other code base. As long as the result improves on its source code, we get roughly the same risks and benefits as with recursive self improvement.

  4. I agree with that, but would describe it a bit differently (though perhaps equivalently).

    I claim your example can legitimately be described as “an AI with a self-concept, resulting in recursive self-improvement”, merely by admitting a “distributed system” as a legitimate kind of “self”. (In other words, arguing about whether your example AI has a “self-concept” is mostly just a “purely semantic distinction”. That said, calling it a “distributed self” might help you think about it in certain ways.)

    In case this is too vague, I’ll be much more precise, at the cost of being pedantic and verbose:


    Suppose an AI has an ongoing subgoal (G) that can be described as “find, anywhere, code you can access which has property X, and attempt to modify it so that it has a greater value of property Y (but still has property X)”.

    Then if this AI has sufficient power, skill, and access to the world, we predict that its actions will gradually increase the average value of property Y among all findable code with property X.

    Suppose further that this AI’s own source code has property X, and that having a greater value of property Y generally makes it “work better”, at least in the sense of better following the named subgoal G.

    For some values of X and Y, this is equivalent to your example. We can describe the predicted effect as “recursive improvement” of the stated body of code in the world.

    But all it takes to then describe this as “recursive self-improvement” is to define “self” as “the system consisting of all the processes enabled by that same body of code”, i.e. all the code with property X (which is findable for modification by that kind of code, and has the necessary power and access).

    (Maybe your example is more general in not requiring the AI’s own code to have property X. But the “stable state of recursive improvement” it leads to would involve acts by code which does have property X.)

    In other words, if we simply accept that the concept “self” could refer to a distributed system, and don’t require the AI to associate that concept with the English word spelled s-e-l-f, then your example *is* recursive self-improvement, and the AI has the self-concept “everything findable which is run by code with property X”.


    My “draft general definition of good” is related to this. It’s old, but was first published as this recent blog comment:


    That comment also relates to our present topic, and perhaps helps clarify it further. For more context, see my earlier and later comments on that post, including this followup which spells out how my definition works around the “Löbean obstacle” to the Tiling Agents problem:


  5. I should add a more concise summary of my “draft definition of good”:

    “find good processes and help them, or potentially good processes and help them be good”,

    or in short

    “help good” (whereever and in whatever form you find it).

    (I acknowledge the lack of a necessary “base case” in this self-referential definition. But I claim that humans who consider themselves to be good, are implicitly using some definition of this form, though generally each human gives it a different “base case”. Often they ally with humans with sufficiently similar base cases, so the resulting group is also “doing good” in this way, using some more general base case.)

Leave a Reply

Your email address will not be published. Required fields are marked *