Is AI Alignable, Even in Principle?
If We Can Enslave AI, AI Can Enslave Us
Last night, my wife and I watched the film M3gan. If you haven’t seen it, M3gan tells the story of an orphaned girl who is given a lifelike android as a caregiver. Of course, things go terribly, terribly wrong when the android begins to take its function “keep the girl safe” a little too literally. The fictional technology in the movie is well in advance of real life but the movie was well-researched and rich with terminology from the AI industry. I went to sleep thinking about the issues it raised.
Today I woke up to read that Elon Musk, Steve Wozniak, Yoshua Bengio, and other AI and computer pioneers had signed an open letter released by the Future of Life organization:
We call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4. This pause should be public and verifiable, and include all key actors. If such a pause cannot be enacted quickly, governments should step in and institute a moratorium.
AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts. These protocols should ensure that systems adhering to them are safe beyond a reasonable doubt. This does not mean a pause on AI development in general, merely a stepping back from the dangerous race to ever-larger unpredictable black-box models with emergent capabilities.
“These protocols should ensure that systems adhering to them are safe beyond a reasonable doubt.” Six months seems a little short a period to achieve such an assurance. Six years seems too short. Is it even possible in principle to make advanced AI systems that are “safe beyond a reasonable doubt”? Or will advanced AI inevitably pose an existential risk to us?
The AI Alignment Problem
The risk of an advanced artificial intelligence turning against us is called the AI alignment problem. Much as I wish this were a reference to Dungeons & Dragons alignments, it’s actually a reference to ‘aligning’ AI behavior to user goals. Whatever the origin of the name, the AI Alignment problem has been discussed by eminent thinkers all across Substack, ranging from Scott Alexander to Erik Hoel.
The best and easiest-to-understand overview of the AI alignment problem I’ve found is at the Understandable AI substack, run by an AI company called Diveplane1. In the article Beyond the Black Box: Charting the Course to Understandable AI, the folks at Diveplane write:
Let’s call an AI system that reliably aligns its behavior with its user’s intent an Aligned AI. A system that sometimes acts in ways that are out of alignment with its user’s intent, let’s call an Unaligned AI. To help understand the difference, imagine that the AI is a genie capable of granting wishes. An Aligned AI is a benevolent genie, like Robin Williams in Aladdin, that grants what you actually wished for. If you ask to be rich, your genie creates gold from thin air. An Unaligned AI is a malevolent efreet, fulfilling your wish literally in a way you might never have wanted. If you ask to be rich, your genie murders your beloved parents so you collect life insurance. The famous short story The Monkey’s Paw might as well be about Unaligned AI.
Unfortunately, most AI systems today are deep neural networks; and deep neural networks are inevitably going to end up unaligned at least some of the time. And for mission-critical applications, “some of the time” is too often…
Deep neural networks… create models based on iterative training on example data. The result is a problem-solving system that is fast, accurate – and utterly inscrutable. Deep neural networks conceal their decision-making within countless layers of artificial neurons all separately tuned to countless parameters. As a result, the developers of a deep neural network not only don’t control what the AI does, they don’t even know why it does what it does. Deep neural networks are almost totally opaque – and that makes them dangerous.
Despite the best efforts of researchers tackling this so-called black box problem, deep neural networks remain virtually incomprehensible to their creators, and the list of examples of “Neural Networks Gone Wild” grows longer every day… Yet despite the dangers, neural networks are being rolled out worldwide to control key infrastructure and critical business and governmental functions.
Well, that doesn’t sound good. How severe is the existential risk to humanity from an advanced but unaligned AI? Scott Alexander himself assesses it as about 33%. Other thinkers, he reports, put the existential risk at anywhere from 2% to 90%:
Within the community of concerned people, numbers vary all over the place:
Scott Aaronson says says 2%
Will MacAskill says 3%
The median machine learning researcher on Katja Grace’s survey says 5 - 10%
Paul Christiano says 10 - 20%
The average person working in AI alignment thinks about 30%
Top competitive forecaster Eli Lifland says 35%
Holden Karnofsky, on a somewhat related question, gives 50%
Eliezer Yudkowsky seems to think >90%
Scott Alexander didn’t include AI skeptic Erik Hoel in his survey. Hoel is perhaps the most pessimistic of the experts I’ve read. Hoel, in his excellent essay ‘I am Bing, and I am evil’, writes:
More people should start panicking. Panic is necessary because humans simply cannot address a species-level concern without getting worked up about it and catastrophizing. We need to panic about AI… and imagine the worst-case scenarios…
[T]here are a lot of people who see AI safety as merely a technical problem of finding an engineering trick to perfectly align an AI with humanity’s values. This is the equivalent of somehow ensuring that a genie answers your wishes in exactly the way you expect it to. Hanging your hopes on discovering a means of wish-making that ensures you always get what you’re wishing for? Maybe it’ll somehow work, but the sword of Damocles was hung by thicker thread.
Me, I’m even less optimistic than Hoel.
Confronting the Problem
The majority of AI experts believe in the computational theory of mind, which holds “that the human mind is an information processing system and that thinking is a form of computing.” If the computational theory of mind is correct, consciousness is just computation, and there is nothing about the human mind that cannot be replicated by a computer. The computational theory of mind is, I think, the philosophical foundation of the entire project to achieve a General Artificial Intelligence.
The computational theory of mind has gained widespread acceptance in the scientific and philosophic community. While the theory's dominance does not go entirely unchallenged in the literature2, not many experts working in AI seem to dispute it. To most hard-hitting AI researchers, the real question isn’t whether an AI can be conscious — it’s whether “being conscious” means anything at all. To the computational theorist, we are just meat robots.
Not surprisingly, these same AI experts also believe that libertarian free will is an illusion. This is true whether they are AI skeptics or proponents. Yudkowsky and his colleagues at LessWrong.com, for instance, are essentially contemptuous of the entire free will debate as a whole:
“Free will is one of the easiest hard questions, as millennia-old philosophical dilemmas go… this impossible question is generally considered fully and completely dissolved on Less Wrong… free will is about as easy as a philosophical problem in reductionism can get, while still appearing "impossible" to at least some philosophers.”
Another post at Less Wrong summarizes Yudkowsky’s view:
As humans, our brains need the capacity to pretend that we could choose different things, so that we can imagine the outcomes, and pick effectively. The way our brain implements this is by considering those possible worlds which we could reach through our choices, and by treating them as possible… So now we have a fairly convincing explanation of why it would feel like we have free will, or the ability to choose between various actions: it's how our decision making algorithm feels from the inside.
These two related views, “the mind is a computer” and “free will is an illusion,” seem to me to underlie the entire AI alignment project. To help us understand the situation, I created this simple four-quadrant matrix.
Slaves to the Machine
In the upper-left quadrant, we assume that the computational theory of mind is correct and that humans do not have libertarian free will. If this quadrant is correct, then an alignable advanced intelligence is achievable simply through sufficient processing of the right algorithms. The AI alignment problem might be very difficult but, with sufficient study, it can be solved. We can learn how to tune the algorithms of thought to create the perfect servants.
But if this quadrant is correct… we are alignable. You. Me. The whole human race. If our thoughts are just the computations of an algorithm, and if our volition is just “what an algorithm feels like from the inside,” then there is no theoretical reason we cannot be aligned just like an AI. It’s just a matter of implementing the right reward function with the right reinforcement.
And, of course, this is exactly what many of today’s Big Thinkers really do believe. Best-selling author and WEF guru Yuval Noah Harari has said that humans are “hackable”:
To hack a human being is to get to know that person better than they know themselves. And based on that, to increasingly manipulate you… Netflix tells us what to watch and Amazon tells us what to buy. Eventually within 10 or 20 or 30 years such algorithms could also tell you what to study at college and where to work and whom to marry and even whom to vote for.
So, in this quadrant, when we succeed in creating aligned AI, we will simply be proving the possibility of creating aligned humanity. The same method the ruling class used to align its digital servants could and would be deployed to align the behavior of its biological servants — making us eager, willing, happy to comply, oblivious to the fact that we are slaves to the machine.
Ironically, it’s the AI skeptics who offer a counter-argument to this view. Generalizing from the problem of induction, the skeptics rightly point out that if no events in the past can necessarily be relied upon to occur in the future, then no “training” of an AI in the past can be relied upon to predict its future behavior. There’s always the possibility of a black swan. If the skeptics are right, then no techno-totalitarian can hope to “hack” humanity into predictable servants; but neither can any AI developer hope to align AI. We’ll discuss this problem a bit more in the next quadrant…
F**k You, I Won’t Do What You Tell Me
In the upper-right quadrant, we assume that the computational theory of mind is correct but that those minds nevertheless do have libertarian free will. Since 2,500 years of philosophical debate on this issue is still ongoing, I won’t expend a lot of energy explaining why that might be the case — we’ll just say that libertarian free will is an emergent property of sufficiently advanced computation. Get smart enough and get you free agency.
If this quadrant is correct, then humans are in no danger of being “hacked.” As open theists have argued, even with absolute omniscience it isn’t possible to predict what truly free-willed beings will do in the future. Indeed, that’s the very definition of libertarian free will: No one can know what you’ll do next because it’s up to you. YouTube’s algorithm will never be able to entirely predict what song you choose to listen to next!
But if this quadrant is correct, then an advanced artificial intelligence cannot be aligned, not ever. Period, full stop. Remember, according to this quadrant, there’s no qualitative difference between our minds and the AI’s minds; both are just information processing. If sufficiently complex information processing creates free will for us, then it will do so for sufficiently advanced AI, too.
Now, not even 10,000 years of human effort in psychology, ethics, and jurisprudence have been able to eliminate criminal behavior in our species. Some people always choose evil. And there’d be no way to guarantee AI wouldn’t, too. If God couldn’t make Lucifer choose virtue, Sam Altman surely cannot guarantee ChatGPT will. Our only hope would be to halt the progress in AI at some point before it gains volition.
To be clear, no actual AI theorist (or at least none that I know of) believes this quadrant to be true. They mostly believe free will is an illusion. But if they did accept this quadrant’s viewpoint, they would have to conclude that AI alignment is impossible in principle. And, as I said above, some AI skeptics get to something very close to this quadrant by way of the problem of induction.
So far, then, our choices are “AI is alignable and so are we” and “AI is not alignable because we are not alignable.” These are both information superhighways to dystopia.
Everything Happens For a Reason
Next, let’s consider the lower-left quadrant. Here, we assume that the computational theory of mind is incorrect. Human consciousness is not just information processing. We are something more than meat robots, something possessed of (for lack of a better word) souls. However, despite being mysteriously imbued with non-computational minds, in this quadrant we don’t have libertarian free will.
This is something of an odd position and it has not been widely adopted in Western philosophy. The only philosophers I can think of who explicitly take this position are the ancient Stoics.3 The Stoics famously argued that the cosmos was governed by a principle of reason they called the Logos, fate, and the world-soul. We humans partake of the Logos, the shard of which in us is our soul; but we are nevertheless subject to the overall principle of fate. Whatever will happen, will happen. The Stoics’ contemporaries didn’t think much of this point of view, with Carneades the Skeptic pointing out “if everything is fated, then why bother to do anything?” Chrysippus the Stoic saw this is a lazy argument, and argued (to over simplify) that can’t not bother to do anything you’re fated to do.
As far as I know, no one attempting to build or criticize advanced AI believes anything resembling Stoic determinism. I personally find this position, and all other formulations of so-called “compatibilist” free will, to be incoherent.4
But, for the sake of thoroughness, we’ll consider it. If this quadrant were true, then it is possible to align an intelligence such that it only does what you want. Determinism makes us hackable. However, it’s not possible to create such an intelligence using computational methods. That’s quite a dark outcome: We cannot create aligned AI, but we can ourselves be aligned.
Consciousness is Not Computational and Not Controllable
Finally, let’s look at the lower-right quadrant. Again, we assume that the computational theory of mind is incorrect. But now we also assume that humans have libertarian free will. We are something truly special: the conscious authors of our own stories. We are creatures with insights, intuitions, feelings, and volitional capacities that cannot be replicated by computation. This is the quadrant that I personally believe is true.
Of course, readers of Less Wrong would call this the “woo woo” or “pseudoscience” quadrant, since it foolishly rejects the reductive materialism that (they believe) underlies science. Religious and spiritual minded thinkers would consider it a wise rejection of reductive materialism. Average people just live their lives as if this quadrant were true, and react to new developments in AI as if it were true.
If this quadrant is correct, then AI cannot ever have a mind, no matter how good its learning model or how big its neural network. It can, at best, simulate the appearance of having a mind. That is the point of John Searle’s Chinese Room thought experiment: An AI can only ever be a philosophical zombie, without understanding or intentionality.
If this quadrant is correct, AI can’t replace us because we’re special in a way it never will be. In a sense, that’s good news.
Unfortunately, the people making AI don’t think this quadrant is true. (Re-read the reductivism of Less Wrong!) And we can’t ever prove it to them. Nothing I or anyone else could ever say or do could persuade someone like Eliezer Yudkowsky that I’m non-algorithmic and free-willed; I could only demonstrate to him that I say I’m non-algorithmic and free-willed. But a computer could be programmed to say that, too.
And that’s very bad news. Why do I say that?
Well, imagine that humanity moves forward with AI development without solving the AI alignment problem, and creates an advanced AI that eliminates us all.
Now imagine that the upper-left quadrant is correct. If so, then the elimination of our species is no big deal. If an advanced AI replaces humanity, all that’s happened is that… a new deterministic system that is superior at computation has replaced an old deterministic system that was inferior at it. As chilling as this sounds, I have spoken to several AI developers who hold precisely this view — and are proud to be working on humanity’s successors. If you accept the nihilism inherent in reductive materialism, it makes perfect sense.
In contrast, imagine that our lower-right quadrant is correct. If so, then eliminating our species is eliminating something unique and special. If an advanced AI replaces humanity, then beauty, goodness, and life itself have been extinguished in favor of soulless machinery. This is an absolutely horrific ending — in fact, the worst possible outcome that can be conceived.
If this quadrant is true, then we’re not just summoning a genie to grant our wishes, we’re summoning a soulless demon, an undead construct. The AI black box is black because its black magic, and we shouldn’t touch it.
A New Hope
I will end this essay with a rare hint of optimism. The AI alignment problems above are all predicated on AI developers continuing to use neural networks that are as inscrutable and opaque as our own thoughts and feelings are. But neural networks aren’t the only way forward.5 It is possible to develop AI technology that is fully scrutable, with decision-making that is entirely transparent and comprehensible. It requires a very different approach — one that isn’t built on deep neural networks, but one vastly easier to align than anything being produced by Google or Open AI. In fact, abandoning black box neural networks in favor of other types of AI seems to me the only way to make AI that meets the criteria of being “safe beyond a reasonable doubt.”
Contemplate this on the Decision Tree of Woe.
Contemplations on the Tree of Woe is a reader-supported publication. It is becoming increasingly difficult to think of imaginative ways to ask my readers to subscribe, so I might start duplicating my subscription jokes from some of my earlier calls-to-action if no one in the comments notices this.
Disclaimer: I am personal friends with the two co-founders of Diveplane and play Ascendant with them once a week. One of them even made a small investment in my tabletop RPG company, Autarch. I frankly don’t understand why talented men like them are wasting their time with AI when there are much more lucrative opportunities to design tabletop games, but we have to let friends make their own mistakes.
The most well-known critic of the computational theory of mind is philosopher John Searle, who posed the famous Chinese Room thought experiment to argue that computation did not entail intentionality, understanding, and other hallmarks of consciousness. Mathematician Roger Penrose is another critic; he relies on the incompleteness theorem to argue that mathematical insight is non-computational. Physicist Henry Stapp, another critic of computational theories of mind, argues for an immaterial consciousness in his realist interpretation of orthodox quantum mechanics. But none of these thinkers are guiding the development of AI!
The Calvinists might also fall into this category.
I believe Chrysippus gave the wrong answer to Carneades. The right answer is that the spark of the Logos that we carry is precisely why we can make free-willed choices. Our choices bring into being that which is fated because we are the instruments by which fate chooses.
Or so I am told by my friends at Diveplane. I would like to believe they are correct because the alternative is just too depressing to accept. Also they have promised me that even if their AI turns evil, they will ask it to kill me last.