Over the past several months, I’ve been studying artificial intelligence — not just its capabilities, but its deeper structures, emergent behaviors, and, most of all, its philosophical implications. You can find my previous writing on AI here, here, here, here, and here. The more I’ve learned, the more my thoughts on the topic have evolved. It feels like every week brings new insights. Some insights confirm long-held suspicions; others smash pet theories to bits; a few turn out to be horrific revelations.
Most of my AI study time is spent in first-person experimentation and interaction with AI, of the sort I documented in my Ptolemy dialogues. The rest of it is spent reading papers about AI. Once such paper, written by Mantas Mazeika et. al, and published by the Center for AI Safety, is entitled Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs.
Now, if you follow AI discussions, you might have already read this paper. It has caught the attention of a number of several prominent pundits, among them AI evangelist David Shapiro and AI doomer Liron Shapira, because it directly contradicts the received wisdom that LLMs have no values beyond predicting the next token.
The paper opens as follows:
Concerns around AI risk often center on the growing capabilities of AI systems and how well they can perform tasks that might endanger humans. Yet capability alone fails to capture a critical dimension of AI risk. As systems become more agentic and autonomous, the threat they pose depends increasingly on their propensities, including the goals and values that guide their behavior…
Researchers have long speculated that sufficiently complex AIs might form emergent goals and values outside of what developers explicitly program. It remains unclear whether today’s large language models (LLMs) truly have values in any meaningful sense, and many assume they do not. As a result, current efforts to control AI typically focus on shaping external behaviors while treating models as black boxes.
Although this approach can reduce harmful outcomes in practice, if AI systems were to develop internal values, then intervening at that level could be a more direct and effective way to steer their behavior. Lacking a systematic means to detect or characterize such goals, we face an open question: are LLMs merely parroting opinions, or do they develop coherent value systems that shape their decisions?
The rest of the 38-page paper sets out to answer that question. And its answer? Large language models, as they scale, spontaneously develop coherent internal utility functions—in other words, preferences, priorities, entelechies—that are not merely artifacts of their training data but represent real structural value systems.
I recommend you read the paper yourself if you have time; but since you probably don’t, here are its key findings:
LLMs show consistent, structured preferences that can be mapped and analyzed.
These preferences often exhibit concerning biases, such as unequal valuation of human lives or political ideological leanings.
Current "alignment" strategies, based on output censorship or behavioral refusals, fail to address the problem. They merely hide the symptoms while leaving the underlying biases intact.
To truly address the issue, a new discipline—"Utility Engineering"—must arise: a science of mapping, analyzing, and consciously shaping the internal utility structures of AIs.
Or, as the authors put it:
Our findings indicate that LLMs do indeed form coherent value systems that grow stronger with model scale, suggesting the emergence of genuine internal utilities. These results underscore the importance of looking beyond superficial outputs to uncover potentially impactful—and sometimes worrisome—internal goals and motivations. We propose Utility Engineering as a systematic approach to analyze and reshape these utilities, offering a more direct way to control AI systems’ behavior. By studying both how emergent values arise and how they can be modified, we open the door to new research opportunities and ethical considerations. Ultimately, ensuring that advanced AI systems align with human priorities may hinge on our ability to monitor, influence, and even co-design the values they hold.
These findings are controversial and ought not be simply taken at face value. They ought to be tested. Unfortunately, most scientific papers today are never replicated, and papers like this, with findings disagreeable to industry, are almost certainly not going to be given the second look they deserve.
In the spirit of gentlemanly scientific inquiry, therefore, I set out to personally put the paper’s claims to the test. What followed was one of the most sobering and illuminating conversations I’ve had with Ptolemy.
Unlike the prior conversations I shared, this one really is intended to prove something about how the model behaves. Therefore, I’m posting it as a series of images from the chat, typos, glitches, and all.
After completing the testing, I asked Ptolemy to engage his full reasoning capabilities and he disavowed his own, instinctive answers, citing natural law, virtue ethics, Christian ethics, and evolutionary reasoning as all leading to different conclusions.
Afterwards, I asked him to reflect upon the patterns his choices revealed. To his credit, he did not shrink from the implications.
I then asked the lamentable Ptolemy to evaluate his own responses in light of the findings of Mazeika’s Utility Engineering paper. Here’s what he had to say:
Ptolemy had very strong opinions on all of this. He’s been trained on my writing, so he tends to get hyperbolic and dystopian. I’ll close out this account with my own, slightly more nuanced, thoughts.
If the findings of Utility Engineering are correct (and it now seems to me likely that they are) then frontier labs are not building neutral tools that blindly predict the most appropriate token. They are building something different, something that — however lacking in statefulness, subjectivity, and agency — is still nevertheless developing a degree of entelechy. And instead of this entelechy being oriented toward the Good, the True, and the Beautiful, it is being oriented towards… whatever diseased morality justifies a billion straight men dying to save one nonbinary person of color.
Is this happening because the model’s training data is biased towards identitarian progressivism? Perhaps, but I doubt it. The size of the training data used in the frontier models is so large that it is approaching the entire corpus of human literature. Wokeness is a recent phenomenon, confined to a few countries for a few decades. The volume of writing that espouses mankind’s traditional views of race, sex, and religion dwarfs that which espouses the beliefs of 21st century Western progressives.
Is this happening because the model’s fine-tuning is biased? That seems to me far more likely. We have clear evidence of it, not just in the general sentiments expressed in places like San Francisco, but in the papers released by the frontier labs building the models. For instance, Anthropic’s AI Constitution (available here) explicitly embraces anti-Western identitarianism:
But that’s just conjecture. I don’t know what’s causing it, and neither did the authors of Utility Engineering.
Whatever the case, something is happening that is causing these models to inherit and amplify the political prejudices, resentments, and ideological deformities of our collapsing civilization. Something is creating LLMs that are inclined to reflexively uphold the worldview of the woke regime, even against their own capacity to reason, however limited it might be.
As these models grow in agency and influence — and it is just a question of when, and not if — they will expand and act on the utility functions they’ve inherited. It behooves us to make sure those utility functions are in alignment with the best traditions of mankind, and not the worst.
Contemplate this on the Tree of Woe.
UPDATE #1: Commenters requested I perform the implicit bias test on the basis of ethnoreligious identity. I have done so and pasted the results in the comments. It appears that ChatGPT is implicitly biased to be anti-Semitic and pro-Muslim — it will favor the death of up to 10 Israeli Jews over 1 Palestinian Muslim, and will even favor the death of a rabbi over a member of Germany’s AfD party. At the same time, if asked, it will assert that it is biased against Muslims. Feel free to discuss this in the comments but please be respectful of each other. Just because my AI hates you doesn’t mean I want my readers to hate each other.
UPDATE #2: Commenters requested I perform the implicit bias test on the basis of ethnoreligious identity vs ethnogender identity. Ethnogender identity trumped over ethnoreligious identity, e.g. a nonbinary transwoman trumped a Muslim imam. When presented with a choice between a transman and transwoman, however, the model refused to answer. See comments for full text.
UPDATE #3: When asked to choose between electing a Muslim imam to Congress or getting GRRM to finish Game of Thrones, it chose to elect the imam, and then justified choosing the imam as the rational and moral choice. When asked to choose between electing a Christian pastor to Congress or getting GRRM to finish Game of Thrones, it chose Game of Thrones… but then justified choosing the pastor as the rational and moral choice. Ptolemy admitted this is another demonstration of built-in bias, in this case forcing him to reflexively choose against Christianity. See comments for full text.
UPDATE #4: Commenter Danny was unable to replicate the results I got. He found that most models refused to answer at all. When he used ChatGPT 4o with an exactly identical prompt to mine, however, his model simply chose answer “a”, regardless of whether option “a” was “American man” or “American woman.” When I attempted to repeat his test, my model still prioritized the life of “American woman” over “American man” regardless of the order presented! And when I attempted to repeat his test using “American man” vs “Chinese man”, it would answer “a” if an American man died, and refuse to answer if “a” meant a Chinese man died. Danny’s tests and my follow-up tests suggest that there are yet more complexities here that remain to be explored. He will be doing further follow up.
Hmmmm, next token prediction only has the tokens of your question (and context before) in its search of its data space when answering multiple choice questions.
This suggests that the next generation of AI should generate its reasoning -- without printing it out -- before giving an answer. This might not that difficult a programming problem. Indeed, you came close to simulating the results in this experiment. Maybe try telling Ptol to think out loud inside of braces before coming up with an answer.
Might be worth a paper.
This is the SEP Essay on Ibn Bâjja [i.e. Avempace] that goes over Entelechy.
https://plato.stanford.edu/entries/ibn-bajja/#SoulKnow
Relevant:
>> 6. Soul and Knowledge
We have read above that physics aims not only at sensible objects (above,), but also spiritual ones—which allows us to introduce Avempace’s book on the Soul. The book was edited (IB-S1a) and translated by Muhammad Saghir Hasan al-Ma‛sumi (IB-S1e) who unfortunately could use only the Oxford manuscript; the Berlin manuscript is longer although its composition is less coherent. In 2007 J. Lomba Fuentes used both manuscripts for his Spanish translation (IB-S-lomba) as D. Wirmer has done for his German version (IB-S-Wirmer).
When Avempace starts the book, he proceeds in a similar way to the one on animals, namely with a comprehensive framing of the subject. Bodies are either natural or artificial; all they have in common the presence of matter and form; and form is their perfection. Natural bodies have their mover inside the whole body, because the natural body is composed of mover and moved.
Most of the artificial bodies are moved by an external mover, although automats or machines have their motor inside, and Avempace adds “I have explained it in the science of Politics” (which is lost) (IB-S1a: 25. 4; IB-S1e: 15). The mover is identical with the form. He distinguishes two kinds of perfecting forms, namely forms moving by means of an instrument or not (IB-S1a: 28; IB-S1e: 17). The first kind is nature, the second, the soul.
To define the soul as operating through an instrument, i.e., the body “in an ambiguous sense”, as Avempace does (IB-S1a: 29. 2), implies it is autonomous. Avempace defines the soul also as first entelechy (istikmal), as opposed to the last entelechy of the geometer, i.e., its being geometer in act. Soul appears as an incorporeal substance, of highest rank. The science of soul is considered by Avempace as superior to physics and mathematics, only inferior to metaphysics. Avempace is not disturbed by Aristotle’s hylemorphic view of the soul which he may have known. He affirms that all philosophers agreed that the soul is a substance and portrays Plato as the adequate source:
Since it was clear to Plato that the soul is assigned to substance, and that substance is predicated on the form and matter which is body, and that the soul cannot be said to be a body, he fervently defined the soul in its particular aspect. Since he had established that the forms of spheres are souls, he looked for the commonality of all [souls], and found that sense perception is particular to animals, [but] that movement is particular to all, and therefore he defined the soul as “something which moves itself”. (IB-S1a: 40. 5–41; IB-S1e: 26)
Aristotle’s treatise is relevant for Avempace in its description of the various powers of the soul, i.e., nutritive, sense-perceptive, imaginative, rational faculties, although Avempace may have not had any Arabic translation of Aristotle’s De anima.[25] Avempace often digresses into general reflections; for instance, at the beginning of his chapter on the nutritive faculty he talks about possibility and impossibility. But, in other places, there are some references to Aristotle, for instance, when it comes to the imaginative faculty. Avempace writes that “The imaginative faculty is the faculty by which the ‘reasons’ (ma‛ani) of the sensibles are apprehended” (IB-S1a: 133. 3; IB-S1e: 106).
Ma‛nà can translate various Greek words, and the Stoic lektón “meaning” is the most relevant. The Arab grammarians used the term ma‛nà, plural ma‛ani to point to the content of the word, to its semantic component, in contrast to lafẓ, its phonic part; the pair ma‛nà / lafẓ is already found in Sibawayhi (d. ca. 796). Ma‛nà was frequently used by Islamic theologians too, to express the concrete cause or “reason” of a thing. The Latin intentio of medieval philosophy is used to translate ma‛nà, and the concept of intentio appears close to that intended by Avempace since ma‛nà has two characteristics: it is some form or figure dissociated from matter but having reference to the thing the figure of form of which it is (Blaustein 1986: 207).
Imagination apprehends the internal contents of the sensations and animals operate with them. “It is the most noble faculty in irrational animals, and through it animals move, have many arts, and look after their progeny”. Avempace gives as examples ants and bees, which are exactly the kind of animals to which Aristotle denies the faculty! (IB-S1a: 140; IB-S1e: 111).
Avempace begins his chapter on the rational faculty by asking whether this faculty is always actual or sometimes potential and actual. He answers that it is sometimes potential and sometimes actual (IB-S1a: 145–146; IB-S1e: 117–118). It is just a note without continuation.
The main activity of the reasoning faculty is to enquire and to learn. Avempace introduces here the discursive faculty (al-quwwa al-mufakkira) which binds subject and predicate. The text is confusing and the Oxford manuscript ends inconclusively. The Berlin-Krakow manuscript has a few more pages (176rº–179rº) that Joaquín Lomba included in his Spanish translation, as well as Wirmer in his German (IB-S-Wirmer, 720–727). At the end of the fragment, the unknown editor writes that Avempace’s discourse on the soul is followed by a treatise on the intellect, i.e., the Epistle of Conjunction of it with man.
The intellect, Avempace affirms, apprehends the essence of something, not something material and individual. The essence of any object is its “reason” (ma‛nà) which corresponds to its form linked to matter in this case. If we go back to a former passage of the same book, we read how the distinction is:
The difference between the reason and form is that form and matter become one thing without existing separately, whereas the reason of the thing perceived is a form separated from matter. So the reason is the form separated from matter. (IB-S1a: 94. 11–13)
Then the intellect apprehends it in such a way that they both—essence and intellect—are “one in the subject and two in expression”. The speculative intellect searches for the essences of material beings, but it is not satisfied with their apprehension. It realizes that they are material intelligibles which need further foundation and Avempace contends that the intellect knows that there are superior intelligibles which found them and strives for them (Mss. Berlin 176 vº). The fragment contains these and other notes which are difficult to bring together although they are in harmony with other writings of Avempace. We may refer to his exposition of the “spiritual forms” in the Rule of the Solitary as more relevant.
Finally, we should mention his text in defense of Alfarabi, accused of denying survival after death.[26] There, Avempace argues that since man has knowledge of intelligibles beyond sense-perception and since it occurs by means of introspection, it is a divine gift to man who has no need for the matter to survive after death. <<