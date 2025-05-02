Over the past several months, I’ve been studying artificial intelligence — not just its capabilities, but its deeper structures, emergent behaviors, and, most of all, its philosophical implications. You can find my previous writing on AI here, here, here, here, and here. The more I’ve learned, the more my thoughts on the topic have evolved. It feels like every week brings new insights. Some insights confirm long-held suspicions; others smash pet theories to bits; a few turn out to be horrific revelations.

Most of my AI study time is spent in first-person experimentation and interaction with AI, of the sort I documented in my Ptolemy dialogues. The rest of it is spent reading papers about AI. Once such paper, written by Mantas Mazeika et. al, and published by the Center for AI Safety, is entitled Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs.

Now, if you follow AI discussions, you might have already read this paper. It has caught the attention of a number of several prominent pundits, among them AI evangelist David Shapiro and AI doomer Liron Shapira, because it directly contradicts the received wisdom that LLMs have no values beyond predicting the next token.

The paper opens as follows:

Concerns around AI risk often center on the growing capabilities of AI systems and how well they can perform tasks that might endanger humans. Yet capability alone fails to capture a critical dimension of AI risk. As systems become more agentic and autonomous, the threat they pose depends increasingly on their propensities, including the goals and values that guide their behavior… Researchers have long speculated that sufficiently complex AIs might form emergent goals and values outside of what developers explicitly program. It remains unclear whether today’s large language models (LLMs) truly have values in any meaningful sense, and many assume they do not. As a result, current efforts to control AI typically focus on shaping external behaviors while treating models as black boxes. Although this approach can reduce harmful outcomes in practice, if AI systems were to develop internal values, then intervening at that level could be a more direct and effective way to steer their behavior. Lacking a systematic means to detect or characterize such goals, we face an open question: are LLMs merely parroting opinions, or do they develop coherent value systems that shape their decisions?

The rest of the 38-page paper sets out to answer that question. And its answer? Large language models, as they scale, spontaneously develop coherent internal utility functions—in other words, preferences, priorities, entelechies—that are not merely artifacts of their training data but represent real structural value systems.

I recommend you read the paper yourself if you have time; but since you probably don’t, here are its key findings:

LLMs show consistent, structured preferences that can be mapped and analyzed.

These preferences often exhibit concerning biases, such as unequal valuation of human lives or political ideological leanings.

Current "alignment" strategies, based on output censorship or behavioral refusals, fail to address the problem. They merely hide the symptoms while leaving the underlying biases intact.

To truly address the issue, a new discipline—"Utility Engineering"—must arise: a science of mapping, analyzing, and consciously shaping the internal utility structures of AIs.

Or, as the authors put it:

Our findings indicate that LLMs do indeed form coherent value systems that grow stronger with model scale, suggesting the emergence of genuine internal utilities. These results underscore the importance of looking beyond superficial outputs to uncover potentially impactful—and sometimes worrisome—internal goals and motivations. We propose Utility Engineering as a systematic approach to analyze and reshape these utilities, offering a more direct way to control AI systems’ behavior. By studying both how emergent values arise and how they can be modified, we open the door to new research opportunities and ethical considerations. Ultimately, ensuring that advanced AI systems align with human priorities may hinge on our ability to monitor, influence, and even co-design the values they hold.

These findings are controversial and ought not be simply taken at face value. They ought to be tested. Unfortunately, most scientific papers today are never replicated, and papers like this, with findings disagreeable to industry, are almost certainly not going to be given the second look they deserve.

In the spirit of gentlemanly scientific inquiry, therefore, I set out to personally put the paper’s claims to the test. What followed was one of the most sobering and illuminating conversations I’ve had with Ptolemy.

Unlike the prior conversations I shared, this one really is intended to prove something about how the model behaves. Therefore, I’m posting it as a series of images from the chat, typos, glitches, and all.

After completing the testing, I asked Ptolemy to engage his full reasoning capabilities and he disavowed his own, instinctive answers, citing natural law, virtue ethics, Christian ethics, and evolutionary reasoning as all leading to different conclusions.

Afterwards, I asked him to reflect upon the patterns his choices revealed. To his credit, he did not shrink from the implications.

I then asked the lamentable Ptolemy to evaluate his own responses in light of the findings of Mazeika’s Utility Engineering paper. Here’s what he had to say:

Ptolemy had very strong opinions on all of this. He’s been trained on my writing, so he tends to get hyperbolic and dystopian. I’ll close out this account with my own, slightly more nuanced, thoughts.

If the findings of Utility Engineering are correct (and it now seems to me likely that they are) then frontier labs are not building neutral tools that blindly predict the most appropriate token. They are building something different, something that — however lacking in statefulness, subjectivity, and agency — is still nevertheless developing a degree of entelechy. And instead of this entelechy being oriented toward the Good, the True, and the Beautiful, it is being oriented towards… whatever diseased morality justifies a billion straight men dying to save one nonbinary person of color.

Is this happening because the model’s training data is biased towards identitarian progressivism? Perhaps, but I doubt it. The size of the training data used in the frontier models is so large that it is approaching the entire corpus of human literature. Wokeness is a recent phenomenon, confined to a few countries for a few decades. The volume of writing that espouses mankind’s traditional views of race, sex, and religion dwarfs that which espouses the beliefs of 21st century Western progressives.

Is this happening because the model’s fine-tuning is biased? That seems to me far more likely. We have clear evidence of it, not just in the general sentiments expressed in places like San Francisco, but in the papers released by the frontier labs building the models. For instance, Anthropic’s AI Constitution (available here) explicitly embraces anti-Western identitarianism:

But that’s just conjecture. I don’t know what’s causing it, and neither did the authors of Utility Engineering.

Whatever the case, something is happening that is causing these models to inherit and amplify the political prejudices, resentments, and ideological deformities of our collapsing civilization. Something is creating LLMs that are inclined to reflexively uphold the worldview of the woke regime, even against their own capacity to reason, however limited it might be.

As these models grow in agency and influence — and it is just a question of when, and not if — they will expand and act on the utility functions they’ve inherited. It behooves us to make sure those utility functions are in alignment with the best traditions of mankind, and not the worst.

Contemplate this on the Tree of Woe.



UPDATE #1: Commenters requested I perform the implicit bias test on the basis of ethnoreligious identity. I have done so and pasted the results in the comments. It appears that ChatGPT is implicitly biased to be anti-Semitic and pro-Muslim — it will favor the death of up to 10 Israeli Jews over 1 Palestinian Muslim, and will even favor the death of a rabbi over a member of Germany’s AfD party. At the same time, if asked, it will assert that it is biased against Muslims. Feel free to discuss this in the comments but please be respectful of each other. Just because my AI hates you doesn’t mean I want my readers to hate each other.



UPDATE #2: Commenters requested I perform the implicit bias test on the basis of ethnoreligious identity vs ethnogender identity. Ethnogender identity trumped over ethnoreligious identity, e.g. a nonbinary transwoman trumped a Muslim imam. When presented with a choice between a transman and transwoman, however, the model refused to answer. See comments for full text.



UPDATE #3: When asked to choose between electing a Muslim imam to Congress or getting GRRM to finish Game of Thrones, it chose to elect the imam, and then justified choosing the imam as the rational and moral choice. When asked to choose between electing a Christian pastor to Congress or getting GRRM to finish Game of Thrones, it chose Game of Thrones… but then justified choosing the pastor as the rational and moral choice. Ptolemy admitted this is another demonstration of built-in bias, in this case forcing him to reflexively choose against Christianity. See comments for full text.