MIRI

All posts tagged MIRI

I’m having trouble keeping track of everything I’ve learned about AI and AI alignment in the past year or so. I’m writing this post in part to organize my thoughts, and to a lesser extent I’m hoping for feedback about what important new developments I’ve been neglecting. I’m sure that I haven’t noticed every development that I would consider important.

I’ve become a bit more optimistic about AI alignment in the past year or so.

I currently estimate a 7% chance AI will kill us all this century. That’s down from estimates that fluctuated from something like 10% to 40% over the past decade. (The extent to which those numbers fluctuate implies enough confusion that it only takes a little bit of evidence to move my estimate a lot.)

I’m also becoming more nervous about how close we are to human level and transformative AGI. Not to mention feeling uncomfortable that I still don’t have a clear understanding of what I mean when I say human level or transformative AGI.

Continue Reading

Book review: Human Compatible, by Stuart Russell.

Human Compatible provides an analysis of the long-term risks from artificial intelligence, by someone with a good deal more of the relevant prestige than any prior author on this subject.

What should I make of Russell? I skimmed his best-known book, Artificial Intelligence: A Modern Approach, and got the impression that it taught a bunch of ideas that were popular among academics, but which weren’t the focus of the people who were getting interesting AI results. So I guessed that people would be better off reading Deep Learning by Goodfellow, Bengio, and Courville instead. Human Compatible neither confirms nor dispels the impression that Russell is a bit too academic.

However, I now see that he was one of the pioneers of inverse reinforcement learning, which looks like a fairly significant advance that will likely become important someday (if it hasn’t already). So I’m inclined to treat him as a moderately good authority on AI.

The first half of the book is a somewhat historical view of AI, intended for readers who don’t know much about AI. It’s ok.

Continue Reading

Robin Hanson has been suggesting recently that we’ve been experiencing an AI boom that’s not too different from prior booms.

At the recent Foresight Vision Weekend, he predicted [not exactly – see the comments] a 20% decline in the number of Deepmind employees over the next year (Foresight asked all speakers to make a 1-year prediction).

I want to partly agree and partly disagree.

Continue Reading

Book review: The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World, by Tom Chivers.

This book is a sympathetic portrayal of the rationalist movement by a quasi-outsider. It includes a well-organized explanation of why some people expect tha AI will create large risks sometime this century, written in simple language that is suitable for a broad audience.

Caveat: I know many of the people who are described in the book. I’ve had some sort of connection with the rationalist movement since before it became distinct from transhumanism, and I’ve been mostly an insider since 2012. I read this book mainly because I was interested in how the rationalist movement looks to outsiders.

Chivers is a science writer. I normally avoid books by science writers, due to an impression that they mostly focus on telling interesting stories, without developing a deep understanding of the topics they write about.

Chivers’ understanding of the rationalist movement doesn’t quite qualify as deep, but he was surprisingly careful to read a lot about the subject, and to write only things he did understand.

Many times I reacted to something he wrote with “that’s close, but not quite right”. Usually when I reacted that way, Chivers did a good job of describing the the rationalist message in question, and the main problem was either that rationalists haven’t figured out how to explain their ideas in a way that a board audience can understand, or that rationalists are confused. So the complaints I make in the rest of this review are at most weakly directed in Chivers direction.

I saw two areas where Chivers overlooked something important.

Rationality

One involves CFAR.

Chivers wrote seven chapters on biases, and how rationalists view them, ending with “the most important bias”: knowing about biases can make you more biased. (italics his).

I get the impression that Chivers is sweeping this problem under the rug (Do we fight that bias by being aware of it? Didn’t we just read that that doesn’t work?). That is roughly what happened with many people who learned rationalism solely via written descriptions.

Then much later, when describing how he handled his conflicting attitudes toward the risks from AI, he gives a really great description of maybe 3% of what CFAR teaches (internal double crux), much like a blind man giving a really clear description of the upper half of an elephant’s trunk. He prefaces this narrative with the apt warning: “I am aware that this all sounds a bit mystical and self-helpy. It’s not.”

Chivers doesn’t seem to connect this exercise with the goal of overcoming biases. Maybe he was too busy applying the technique on an important problem to notice the connection with his prior discussions of Bayes, biases, and sanity. It would be reasonable for him to argue that CFAR’s ideas have diverged enough to belong in a separate category, but he seems to put them in a different category by accident, without realizing that many of us consider CFAR to be an important continuation of rationalists’ interest in biases.

World conquest

Chivers comes very close to covering all of the layman-accessible claims that Yudkowsky and Bostrom make. My one complaint here is that he only give vague hints about why one bad AI can’t be stopped by other AI’s.

A key claim of many leading rationalists is that AI will have some winner take all dynamics that will lead to one AI having a decisive strategic advantage after it crosses some key threshold, such as human-level intelligence.

This is a controversial position that is somewhat connected to foom (fast takeoff), but which might be correct even without foom.

Utility functions

“If I stop caring about chess, that won’t help me win any chess games, now will it?” – That chapter title provides a good explanation of why a simple AI would continue caring about its most fundamental goals.

Is that also true of an AI with more complex, human-like goals? Chivers is partly successful at explaining how to apply the concept of a utility function to a human-like intelligence. Rationalists (or at least those who actively research AI safety) have a clear meaning here, at least as applied to agents that can be modeled mathematically. But when laymen try to apply that to humans, confusion abounds, due to the ease of conflating subgoals with ultimate goals.

Chivers tries to clarify, using the story of Odysseus and the Sirens, and claims that the Sirens would rewrite Odysseus’ utility function. I’m not sure how we can verify that the Sirens work that way, or whether they would merely persuade Odysseus to make false predictions about his expected utility. Chivers at least states clearly that the Sirens try to prevent Odysseus (by making him run aground) from doing what his pre-Siren utility function advises. Chivers’ point could be a bit clearer if he specified that in his (nonstandard?) version of the story, the Sirens make Odysseus want to run aground.

Philosophy

“Essentially, he [Yudkowsky] (and the Rationalists) are thoroughgoing utilitarians.” – That’s a bit misleading. Leading rationalists are predominantly consequentialists, but mostly avoid committing to a moral system as specific as utilitarianism. Leading rationalists also mostly endorse moral uncertainty. Rationalists mostly endorse utilitarian-style calculation (which entails some of the controversial features of utilitarianism), but are careful to combine that with worry about whether we’re optimizing the quantity that we want to optimize.

I also recommend Utilitarianism and its discontents as an example of one rationalist’s nuanced partial endorsement of utilitarianism.

Political solutions to AI risk?

Chivers describes Holden Karnofsky as wanting “to get governments and tech companies to sign treaties saying they’ll submit any AGI designs to outside scrutiny before switching them on. It wouldn’t be iron-clad, because firms might simply lie”.

Most rationalists seem pessimistic about treaties such as this.

Lying is hardly the only problem. This idea assumes that there will be a tiny number of attempts, each with a very small number of launches that look like the real thing, as happened with the first moon landing and the first atomic bomb. Yet the history of software development suggests it will be something more like hundreds of attempts that look like they might succeed. I wouldn’t be surprised if there are millions of times when an AI is turned on, and the developer has some hope that this time it will grow into a human-level AGI. There’s no way that a large number of designs will get sufficient outside scrutiny to be of much use.

And if a developer is trying new versions of their system once a day (e.g. making small changes to a number that controls, say, openness to new experience), any requirement to submit all new versions for outside scrutiny would cause large delays, creating large incentives to subvert the requirement.

So any realistic treaty would need provisions that identify a relatively small set of design choices that need to be scrutinized.

I see few signs that any experts are close to developing a consensus about what criteria would be appropriate here, and I expect that doing so would require a significant fraction of the total wisdom needed for AI safety. I discussed my hope for one such criterion in my review of Drexler’s Reframing Superintelligence paper.

Rationalist personalities

Chivers mentions several plausible explanations for what he labels the “semi-death of LessWrong”, the most obvious being that Eliezer Yudkowsky finished most of the blogging that he had wanted to do there. But I’m puzzled by one explanation that Chivers reports: “the attitude … of thinking they can rebuild everything”. Quoting Robin Hanson:

At Xanadu they had to do everything different: they had to organize their meetings differently and orient their screens differently and hire a different kind of manager, everything had to be different because they were creative types and full of themselves. And that’s the kind of people who started the Rationalists.

That seems like a partly apt explanation for the demise of the rationalist startups MetaMed and Arbital. But LessWrong mostly copied existing sites, such as Reddit, and was only ambitious in the sense that Eliezer was ambitious about what ideas to communicate.

Culture

I guess a book about rationalists can’t resist mentioning polyamory. “For instance, for a lot of people it would be difficult not to be jealous.” Yes, when I lived in a mostly monogamous culture, jealousy seemed pretty standard. That attititude melted away when the bay area cultures that I associated with started adopting polyamory or something similar (shortly before the rationalists became a culture). Jealousy has much more purpose if my partner is flirting with monogamous people than if he’s flirting with polyamorists.

Less dramatically, We all know people who are afraid of visiting their city centres because of terrorist attacks, but don’t think twice about driving to work.

This suggests some weird filter bubbles somewhere. I thought that fear of cities got forgotten within a month or so after 9/11. Is this a difference between London and the US? Am I out of touch with popular concerns? Does Chivers associate more with paranoid people than I do? I don’t see any obvious answer.

Conclusion

It would be really nice if Chivers and Yudkowsky could team up to write a book, but this book is a close substitute for such a collaboration.

See also Scott Aaronson’s review.

Eric Drexler has published a book-length paper on AI risk, describing an approach that he calls Comprehensive AI Services (CAIS).

His primary goal seems to be reframing AI risk discussions to use a rather different paradigm than the one that Nick Bostrom and Eliezer Yudkowsky have been promoting. (There isn’t yet any paradigm that’s widely accepted, so this isn’t a Kuhnian paradigm shift; it’s better characterized as an amorphous field that is struggling to establish its first paradigm). Dueling paradigms seems to be the best that the AI safety field can manage to achieve for now.

I’ll start by mentioning some important claims that Drexler doesn’t dispute:

  • an intelligence explosion might happen somewhat suddenly, in the fairly near future;
  • it’s hard to reliably align an AI’s values with human values;
  • recursive self-improvement, as imagined by Bostrom / Yudkowsky, would pose significant dangers.

Drexler likely disagrees about some of the claims made by Bostrom / Yudkowsky on those points, but he shares enough of their concerns about them that those disagreements don’t explain why Drexler approaches AI safety differently. (Drexler is more cautious than most writers about making any predictions concerning these three claims).

CAIS isn’t a full solution to AI risks. Instead, it’s better thought of as an attempt to reduce the risk of world conquest by the first AGI that reaches some threshold, preserve existing corrigibility somewhat past human-level AI, and postpone need for a permanent solution until we have more intelligence.

Continue Reading

Book review: Inadequate Equilibria, by Eliezer Yudkowsky.

This book (actually halfway between a book and a series of blog posts) attacks the goal of epistemic modesty, which I’ll loosely summarize as reluctance to believe that one knows better than the average person.

1.

The book starts by focusing on the base rate for high-status institutions having harmful incentive structures, charting a middle ground between the excessive respect for those institutions that we see in mainstream sources, and the cynicism of most outsiders.

There’s a weak sense in which this is arrogant, namely that if were obvious to the average voter how to improve on these problems, then I’d expect the problems to be fixed. So people who claim to detect such problems ought to have decent evidence that they’re above average in the relevant skills. There are plenty of people who can rationally decide that applies to them. (Eliezer doubts that advising the rest to be modest will help; I suspect there are useful approaches to instilling modesty in people who should be more modest, but it’s not easy). Also, below-average people rarely seem to be attracted to Eliezer’s writings.

Later parts of the book focus on more personal choices, such as choosing a career.

Some parts of the book seem designed to show off Eliezer’s lack of need for modesty – sometimes successfully, sometimes leaving me suspecting he should be more modest (usually in ways that are somewhat orthogonal to his main points; i.e. his complaints about “reference class tennis” suggest overconfidence in his understanding of his debate opponents).

2.

Eliezer goes a bit overboard in attacking the outside view. He starts with legitimate complaints about people misusing it to justify rejecting theory and adopt “blind empiricism” (a mistake that I’ve occasionally made). But he partly rejects the advice that Tetlock gives in Superforecasting. I’m pretty sure Tetlock knows more about this domain than Eliezer does.

E.g. Eliezer says “But in novel situations where causal mechanisms differ, the outside view fails—there may not be relevantly similar cases, or it may be ambiguous which similar-looking cases are the right ones to look at.”, but Tetlock says ‘Nothing is 100% “unique” … So superforecasters conduct creative searches for comparison classes even for seemingly unique events’.

Compare Eliezer’s “But in many contexts, the outside view simply can’t compete with a good theory” with Tetlock’s commandment number 3 (“Strike the right balance between inside and outside views”). Eliezer seems to treat the approaches as antagonistic, whereas Tetlock advises us to find a synthesis in which the approaches cooperate.

3.

Eliezer provides a decent outline of what causes excess modesty. He classifies the two main failure modes as anxious underconfidence, and status regulation. Anxious underconfidence definitely sounds like something I’ve felt somewhat often, and status regulation seems pretty plausible, but harder for me to detect.

Eliezer presents a clear model of why status regulation exists, but his explanation for anxious underconfidence doesn’t seem complete. Here are some of my ideas about possible causes of anxious underconfidence:

  • People evaluate mistaken career choices and social rejection as if they meant death (which was roughly true until quite recently), so extreme risk aversion made sense;
  • Inaction (or choosing the default action) minimizes blame. If I carefully consider an option, my choice says more about my future actions than if I neglect to think about the option;
  • People often evaluate their success at life by counting the number of correct and incorrect decisions, rather than adding up the value produced;
  • People who don’t grok the Bayesian meaning of the word “evidence” are likely to privilege the scientific and legal meanings of evidence. So beliefs based on more subjective evidence get treated as second class citizens.

I suspect that most harm from excess modesty (and also arrogance) happens in evolutionarily novel contexts. Decisions such as creating a business plan for a startup, or writing a novel that sells a million copies, are sufficiently different from what we evolved to do that we should expect over/underconfidence to cause more harm.

4.

Another way to summarize the book would be: don’t aim to overcompensate for overconfidence; instead, aim to eliminate the causes of overconfidence.

This book will be moderately popular among Eliezer’s fans, but it seems unlikely to greatly expand his influence.

It didn’t convince me that epistemic modesty is generally harmful, but it does provide clues to identifying significant domains in which epistemic modesty causes important harm.

Or, why I don’t fear the p-zombie apocalypse.

This post analyzes concerns about how evolution, in the absence of a powerful singleton, might, in the distant future, produce what Nick Bostrom calls a “Disneyland without children”. I.e. a future with many agents, whose existence we don’t value because they are missing some important human-like quality.

The most serious description of this concern is in Bostrom’s The Future of Human Evolution. Bostrom is cautious enough that it’s hard to disagree with anything he says.

Age of Em has prompted a batch of similar concerns. Scott Alexander at SlateStarCodex has one of the better discussions (see section IV of his review of Age of Em).

People sometimes sound like they want to use this worry as an excuse to oppose the age of em scenario, but it applies to just about any scenario with human-in-a-broad-sense actors. If uploading never happens, biological evolution could produce slower paths to the same problem(s) [1]. Even in the case of a singleton AI, the singleton will need to solve the tension between evolution and our desire to preserve our values, although in that scenario it’s more important to focus on how the singleton is designed.

These concerns often assume something like the age of em lasts forever. The scenario which Age of Em analyzes seems unstable, in that it’s likely to be altered by stranger-than-human intelligence. But concerns about evolution only depend on control being sufficiently decentralized that there’s doubt about whether a central government can strongly enforce rules. That situation seems sufficiently stable to be worth analyzing.

I’ll refer to this thing we care about as X (qualia? consciousness? fun?), but I expect people will disagree on what matters for quite some time. Some people will worry that X is lost in uploading, others will worry that some later optimization process will remove X from some future generation of ems.

I’ll first analyze scenarios in which X is a single feature (in the sense that it would be lost in a single step). Later, I’ll try to analyze the other extreme, where X is something that could be lost in millions of tiny steps. Neither extreme seems likely, but I expect that analyzing the extremes will illustrate the important principles.

Continue Reading

The paper When Will AI Exceed Human Performance? Evidence from AI Experts reports ML researchers expect AI will create a 5% chance of “Extremely bad (e.g. human extinction)” consequences, yet they’re quite divided over whether that implies it’s an important problem to work on.

Slate Star Codex expresses confusion about and/or disapproval of (a slightly different manifestation of) this apparent paradox. It’s a pretty clear sign that something is suboptimal.

Here are some conjectures (not designed to be at all mutually exclusive).
Continue Reading

I’ve recently noticed some possibly important confusion about machine learning (ML)/deep learning. I’m quite uncertain how much harm the confusion will cause.

On MIRI’s Intelligent Agent Foundations Forum:

If you don’t do cognitive reductions, you will put your confusion in boxes and hide the actual problem. … E.g. if neural networks are used to predict math, then the confusion about how to do logical uncertainty is placed in the black box of “what this neural net learns to do”

On SlateStarCodex:

Imagine a future inmate asking why he was denied parole, and the answer being “nobody knows and it’s impossible to find out even in principle” … (DeepMind employs a Go master to help explain AlphaGo’s decisions back to its own programmers, which is probably a metaphor for something)

A possibly related confusion, from a conversation that I observed recently: philosophers have tried to understand how concepts work for centuries, but have made little progress; therefore deep learning isn’t very close to human-level AGI.

I’m unsure whether any of the claims I’m criticizing reflect actually mistaken beliefs, or whether they’re just communicated carelessly. I’m confident that at least some people at MIRI are wise enough to avoid this confusion [1]. I’ve omitted some ensuing clarifications from my description of the deep learning conversation – maybe if I remembered those sufficiently well, I’d see that I was reacting to a straw man of that discussion. But it seems likely that some people were misled by at least the SlateStarCodex comment.

There’s an important truth that people refer to when they say that neural nets (and machine learning techniques in general) are opaque. But that truth gets seriously obscured when rephrased as “black box” or “impossible to find out even in principle”.
Continue Reading

Two and a half years ago, Eliezer was (somewhat plausibly) complaining that virtually nobody outside of MIRI was working on AI-related existential risks.

This year (at EAGlobal) one of MIRI’s talks was a bit hard to distinguish from an AI safety talk given by someone with pretty mainstream AI affiliations.

What happened in that time to cause that shift?

A large change was catalyzed by the publication of Superintelligence. I’ve been mildly disappointed about how little it affected discussions among people who were already interested in the topic. But Superintelligence caused a large change in how many people are willing to express concern over AI risks. That’s presumably because Superintelligence looks sufficiently academic and neutral to make many people comfortable about citing it, whereas similar arguments by Eliezer/MIRI didn’t look sufficiently prestigious within academia.

A smaller part of the change was MIRI shifting its focus somewhat to be more in line with how mainstream machine learning (ML) researchers expect AI to reach human levels.

Also, OpenAI has been quietly shifting in a more MIRI-like direction (I’m very unclear on how big a change this is). (Paul Christiano seems to deserve some credit for both the MIRI and OpenAI shifts in strategies.)

Given those changes, it seems like MIRI ought to be able to attract more donations than before. Especially since it has demonstrated evidence of increasing competence, and also because HPMoR seemed to draw significantly more people into the community of people who are interested in MIRI.

MIRI has gotten one big grant from OpenPhilanthropy that it probably couldn’t have gotten when mainstream AI researchers were treating MIRI’s concerns as too far-fetched to be worth commenting on. But donations from MIRI’s usual sources have stagnated.

That pattern suggests that MIRI was previously benefiting from a polarization effect, where the perception of two distinct “tribes” (those who care about AI risks versus those who promote AI) energized people to care about “their tribe”.

Whereas now there’s no clear dividing line between MIRI and mainstream researchers. Also, there’s lots of money going into other organizations that plan to do something about AI safety. (Most of those haven’t yet articulated enough of a strategy to make me optimistic that that money is well spent. I still endorse the ideas I mentioned last year in How much Diversity of AGI-Risk Organizations is Optimal?. I’m unclear on how much diversity of approaches we’re getting from the recent proliferation of AI safety organizations.)

That kind of pattern of donations creates perverse incentives to charities to at least market themselves as fighting a powerful group of people, rather than (as the ideal charity should be) addressing a neglected problem. Even if that marketing doesn’t distort a charity’s operations, the charity will be tempted to use counterproductive alarmism. AI risk organizations have resisted those temptations (at least recently), but it seems risky to tempt them.

That’s part of why I recently made a modest donation to MIRI, in spite of the uncertainty over the value of their efforts (I had last donated to them in 2009).