Review of AI Alignment Progress

I’m having trouble keeping track of everything I’ve learned about AI and AI alignment in the past year or so. I’m writing this post in part to organize my thoughts, and to a lesser extent I’m hoping for feedback about what important new developments I’ve been neglecting. I’m sure that I haven’t noticed every development that I would consider important.

I’ve become a bit more optimistic about AI alignment in the past year or so.

I currently estimate a 7% chance AI will kill us all this century. That’s down from estimates that fluctuated from something like 10% to 40% over the past decade. (The extent to which those numbers fluctuate implies enough confusion that it only takes a little bit of evidence to move my estimate a lot.)

I’m also becoming more nervous about how close we are to human level and transformative AGI. Not to mention feeling uncomfortable that I still don’t have a clear understanding of what I mean when I say human level or transformative AGI.

Shard Theory

Shard theory is a paradigm that seems destined to replace the focus (at least on LessWrong) on utility functions as a way of describing what intelligent entities want.

I kept having trouble with the plan to get AIs to have utility functions that promote human values.

Human values mostly vary in response to changes in the environment. I can make a theoretical distinction between contingent human values and the kind of fixed terminal values that seem to belong in a utility function. But I kept getting confused when I tried to fit my values, or typical human values, into that framework. Some values seem clearly instrumental and contingent. Some values seem fixed enough to sort of resemble terminal values. But whenever I try to convince myself that I’ve found a terminal value that I want to be immutable, I end up feeling confused.

Shard theory tells me that humans don’t have values that are well described by the concept of a utility function. Probably nothing will go wrong if I stop hoping to find those terminal values.

We can describe human values as context-sensitive heuristics. That will likely also be true of AIs that we want to create.

I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/or subagents.

Some of the posts that better explain these ideas:

Shard Theory in Nine Theses: a Distillation and Critical Appraisal
The shard theory of human values
A shot at the diamond-alignment problem
Alignment allows “nonrobust” decision-influences and doesn’t require robust grading
Why Subagents?
Section 6 of Drexler’s CAIS paper
EA is about maximization, and maximization is perilous (i.e. it’s risky to treat EA principles as a utility function)

Do What I Mean

I’ve become a bit more optimistic that we’ll find a way to tell AIs things like “do what humans want”, have them understand that, and have them obey.

GPT3 has a good deal of knowledge about human values, scattered around in ways that limit the usefulness of that knowledge.

LLMs show signs of being less alien than theory, or evidence from systems such as AlphaGo, led me to expect. Their training causes them to learn human concepts pretty faithfully.

That suggests clear progress toward AIs understanding human requests. That seems to be proceeding a good deal faster than any trend toward AIs becoming agenty.

However, LLMs suggest that it will be not at all trivial to ensure that AIs obey some set of commands that we’ve articulated. Much of the work done by LLMs involves simulating a stereotypical human. That puts some limits on how far they stray from what we want. But the LLM doesn’t have a slot where someone could just drop in Asimov’s Laws so as to cause the LLM to have those laws as its goals.

The post Retarget The Search provides a little hope that this might become easy. I’m still somewhat pessimistic about this.

Interpretability

Interpretability feels more important than it felt a few years ago. It also feels like it depends heavily on empirical results from AGI-like systems.

I see more signs than I expected that interpretability research is making decent progress.

The post that encouraged me most was How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme. TL;DR: neural networks likely develop simple representations of whether their beliefs are truth or false. The effort required to detect those representations does not seem to increase much with increasing model size.

Other promising ideas:

I’m currently estimating a 40% chance that before we get existentially risky AI, neural nets will be transparent enough to generate an expert consensus about which AIs are safe to deploy. A few years ago, I’d have likely estimated a 15% chance of that. An expert consensus seems somewhat likely to be essential if we end up needing pivotal processes.

Foom

We continue to accumulate clues about takeoff speeds. I’m becoming increasingly confident that we won’t get a strong or unusually dangerous version of foom.

Evidence keeps accumulating that intelligence is compute-intensive. That means replacing human AI developers with AGIs won’t lead to dramatic speedups in recursive self-improvement.

Recent progress in LLMs suggest there’s an important set of skills for which AI improvement slows down as it reaches human levels, because it is learning by imitating humans. But keep in mind that there are also important dimensions on which AI easily blows past the level of an individual human (e.g. breadth of knowledge), and will maybe slow down as it matches the ability of all humans combined.

LLMs also suggest that AI can become as general-purpose as humans while remaining less agentic / consequentialist. LLMs have outer layers that are fairly myopic, aiming to predict a few thousand words of future text.

The agents that an LLM simulates are more far-sighted. But there are still major obstacles to them implementing long-term plans: they almost always get shut down quickly, so it would take something unusual for them to run long enough to figure out what kind of simulation they’re in and to break out.

This doesn’t guarantee they won’t become too agentic, but I suspect they’d first need to become much more capable than humans.

Evidence is also accumulating that existing general approaches will be adequate to produce AIs that exceed human abilities at most important tasks. I anticipate several more innovations at the level of RELU and the transformer architecture, in order to improve scaling.

That doesn’t rule out the kind of major architectural breakthrough that could cause foom. But it’s hard to see a reason for predicting such a breakthrough. Extrapolations of recent trends tell me that AI is likely to transform the world in the 2030s. Whereas if foom is going to happen, I see no way to predict whether it will happen soon.

Self Concept

Nintil’s analysis of AI risk:

GPT3 is provided as an example of something that has some knowledge that could theoretically bear on situational awareness but I don’t think this goes far (It seems it has no self-concept at all); it is one thing to know about the world in general, and it is another very different to infer that you are an agent being trained. I can imagine a system that could do general purpose science and engineering without being either agentic or having a self-concept. … A great world model that comes to be by training models the way we do now need not give rise to a self-concept, which is the problematic thing.

I think it’s rather likely that smarter-than-human AGIs will tend to develop self-concepts. But I’m not too clear on when or how this will happen.
In fact, the embedded agency discussions seem to hint that it’s unnatural for a designed agent to have a self-concept.

Can we prevent AIs from developing a self-concept? Is this a valuable thing to accomplish?

My shoulder Eliezer says that AIs with a self-concept will be more powerful (via recursive self-improvement), so researchers will be pressured to create them. My shoulder Eric Drexler replies that those effects are small enough that researchers can likely be deterred from creating such AIs for a nontrivial time.

I’d like to see more people analyzing this topic.

Social Influences

Leading AI labs do not seem to be on a course toward a clear-cut arms race.

Most AI labs see enough opportunities in AI that they expect most AI companies to end up being worth anywhere from $100 million to $10 trillion. A worst-case result of being a $100 million company is a good deal less scary than the typical startup environment, where people often expect a 90% chance of becoming worthless and needing to start over again. Plus, anyone competent enough to help create an existentially dangerous AI seems likely to have many opportunities to succeed if their current company fails.

Not too many investors see those opportunities, but there are more than a handful of wealthy investors who are coming somewhat close to indiscriminately throwing money at AI companies. This seems likely to promote an abundance mindset among serious companies that will dampen urges to race against other labs for first place at some hypothetical finish line. Although there’s a risk that this will lead to FTX-style overconfidence.

The worst news of 2022 is that the geopolitical world is heading toward another cold war. The world is increasingly polarized into a conflict between the West and the parts of the developed world that resist Western culture.

The US government is preparing to cripple China.

Will that be enough to cause a serious race between the West and China to develop the first AGI? If AGI is 5 years away, I don’t see how the US government is going to develop that AGI before a private company does. But with 15 year timelines, the risks of a hastily designed government AGI look serious.

Much depends on whether the US unites around concerns about China defeating the US. It seems not too likely that China would either develop AGI faster than the US, or use AGI to conquer territories outside of Asia. But it’s easy for a country to mistakenly imagine that it’s in a serious arms race.

Trends in Capabilities

I’m guessing the best publicly known AIs are replicating something like 8% of human cognition versus 2.5% 5 years ago. That’s in systems that are available to the public – I’m guessing those are a year or two behind what’s been developed but is still private.

Is that increasing linearly? Exponentially? I’m guessing it’s closer to exponential growth than linear growth, partly because it grew for decades in order to get to that 2.5%.

This increase will continue to be underestimated by people who aren’t paying close attention.

Advances are no longer showing up as readily quantifiable milestones (beating go experts). Instead, key advances are more like increasing breadth of abilities. I don’t know of good ways to measure that other than “jobs made obsolete”, which is not too well quantified, and likely lagging a couple of years behind the key technical advances.

I also see a possible switch from overhype to underhype. Up to maybe 5 years ago, AI companies and researchers focused a good deal on showing off their expertise, in order to hire or be hired by the best. Now the systems they’re working on are likely valuable enough that trade secrets will start to matter.

This switch is hard for most people to notice, even with ideal news sources. The storyteller industry obfuscates this further, by biasing stories to sound like the most important development of the day. So when little is happening, they exaggerate the story importance. But they switch to understating the importance when preparing for an emergency deserves higher priority than watching TV (see my Credibility of Hurricane Warnings).

Concluding Thoughts

I’m optimistic in the sense that I think that smart people are making progress on AI alignment, and that success does not look at all hopeless.

But I’m increasingly uncomfortable about how fast AGI is coming, how foggy the path forward looks, and how many uncertainties remain.

I agree with that, but would describe it a bit differently (though perhaps equivalently).

I claim your example can legitimately be described as “an AI with a self-concept, resulting in recursive self-improvement”, merely by admitting a “distributed system” as a legitimate kind of “self”. (In other words, arguing about whether your example AI has a “self-concept” is mostly just a “purely semantic distinction”. That said, calling it a “distributed self” might help you think about it in certain ways.)

In case this is too vague, I’ll be much more precise, at the cost of being pedantic and verbose:

Suppose an AI has an ongoing subgoal (G) that can be described as “find, anywhere, code you can access which has property X, and attempt to modify it so that it has a greater value of property Y (but still has property X)”.

Then if this AI has sufficient power, skill, and access to the world, we predict that its actions will gradually increase the average value of property Y among all findable code with property X.

Suppose further that this AI’s own source code has property X, and that having a greater value of property Y generally makes it “work better”, at least in the sense of better following the named subgoal G.

For some values of X and Y, this is equivalent to your example. We can describe the predicted effect as “recursive improvement” of the stated body of code in the world.

But all it takes to then describe this as “recursive self-improvement” is to define “self” as “the system consisting of all the processes enabled by that same body of code”, i.e. all the code with property X (which is findable for modification by that kind of code, and has the necessary power and access).

(Maybe your example is more general in not requiring the AI’s own code to have property X. But the “stable state of recursive improvement” it leads to would involve acts by code which does have property X.)

In other words, if we simply accept that the concept “self” could refer to a distributed system, and don’t require the AI to associate that concept with the English word spelled s-e-l-f, then your example *is* recursive self-improvement, and the AI has the self-concept “everything findable which is run by code with property X”.

My “draft general definition of good” is related to this. It’s old, but was first published as this recent blog comment:

https://scottaaronson.blog/?p=6823#comment-1944562

That comment also relates to our present topic, and perhaps helps clarify it further. For more context, see my earlier and later comments on that post, including this followup which spells out how my definition works around the “Löbean obstacle” to the Tiling Agents problem:

https://scottaaronson.blog/?p=6823#comment-1944585

5 comments on “Review of AI Alignment Progress”

Bruce Smith on February 9, 2023 at 5:38 pm said:

Thanks for this review.

I thought some of Paul Christiano’s ideas were interesting, such as the thought experiment (as I would call it) of “Approval-directed agents”. (I didn’t follow most of your links, so I don’t know if any of them overlap with this.)

https://ai-alignment.com/model-free-decisions-6e6609f5d99e

I wonder what is your general response to Eliezer’s “doom post” from last year. Evidently he is much less optimistic than you; is it clear why?
Peter on February 10, 2023 at 1:35 pm said:

Bruce,

Some of my disagreement with Eliezer stems from differing beliefs about how quickly AGI will become god-like after reaching human levels.

That presumably indicates some fundamental differences in how we model intelligence.

I get conflicting impressions about whether I have important disagreements with him about the need for proof. He sometimes seems to say that steps which aren’t demonstrably safe have a negligable chance of working, whereas I often think they have more like a 50% chance of working. But I don’t see any clear examples where I can say this is definitely separate from disagreements about whether to treat AGIs as god-like.

Eliezer seems overconfident in ways that magnify other sources of disagreement.
Peter on February 24, 2023 at 11:35 am said:

I’ve changed my mind about AIs developing a self-concept. I see little value in worrying about whether an AI has a self concept.

I had been imagining that an AI would avoid recursive self-improvement if it had no special interest in the source code / weights / machines that correspond to it.

I now think that makes little difference. Suppose the AI doesn’t care whether it’s improving its source code or creating a new tool starting from some other code base. As long as the result improves on its source code, we get roughly the same risks and benefits as with recursive self improvement.
Bruce Smith on February 25, 2023 at 10:24 am said:

I agree with that, but would describe it a bit differently (though perhaps equivalently).

I claim your example can legitimately be described as “an AI with a self-concept, resulting in recursive self-improvement”, merely by admitting a “distributed system” as a legitimate kind of “self”. (In other words, arguing about whether your example AI has a “self-concept” is mostly just a “purely semantic distinction”. That said, calling it a “distributed self” might help you think about it in certain ways.)

In case this is too vague, I’ll be much more precise, at the cost of being pedantic and verbose:

==

Suppose an AI has an ongoing subgoal (G) that can be described as “find, anywhere, code you can access which has property X, and attempt to modify it so that it has a greater value of property Y (but still has property X)”.

Then if this AI has sufficient power, skill, and access to the world, we predict that its actions will gradually increase the average value of property Y among all findable code with property X.

Suppose further that this AI’s own source code has property X, and that having a greater value of property Y generally makes it “work better”, at least in the sense of better following the named subgoal G.

For some values of X and Y, this is equivalent to your example. We can describe the predicted effect as “recursive improvement” of the stated body of code in the world.

But all it takes to then describe this as “recursive self-improvement” is to define “self” as “the system consisting of all the processes enabled by that same body of code”, i.e. all the code with property X (which is findable for modification by that kind of code, and has the necessary power and access).

(Maybe your example is more general in not requiring the AI’s own code to have property X. But the “stable state of recursive improvement” it leads to would involve acts by code which does have property X.)

In other words, if we simply accept that the concept “self” could refer to a distributed system, and don’t require the AI to associate that concept with the English word spelled s-e-l-f, then your example *is* recursive self-improvement, and the AI has the self-concept “everything findable which is run by code with property X”.

==

My “draft general definition of good” is related to this. It’s old, but was first published as this recent blog comment:

https://scottaaronson.blog/?p=6823#comment-1944562

That comment also relates to our present topic, and perhaps helps clarify it further. For more context, see my earlier and later comments on that post, including this followup which spells out how my definition works around the “Löbean obstacle” to the Tiling Agents problem:

https://scottaaronson.blog/?p=6823#comment-1944585
Bruce Smith on February 27, 2023 at 11:10 am said:

I should add a more concise summary of my “draft definition of good”:

“find good processes and help them, or potentially good processes and help them be good”,

or in short

“help good” (whereever and in whatever form you find it).

(I acknowledge the lack of a necessary “base case” in this self-referential definition. But I claim that humans who consider themselves to be good, are implicitly using some definition of this form, though generally each human gives it a different “base case”. Often they ally with humans with sufficiently similar base cases, so the resulting group is also “doing good” in this way, using some more general base case.)

Bayesian Investor Blog

Ramblings of a somewhat libertarian stock market speculator