On AI Alignment, and Why It Feels Like Something Is Missing
Nothing obvious is wrong in the careful work on AI safety. The unease has less to do with what is being said than with what is being assumed: that the deepest questions of human value can be settled by classifiers.
There is a particular kind of discomfort that arises when listening to some conversations about AI alignment. It is not loud or reactive. It sits quietly in the background, as if something important has been left out of the room.
What makes it difficult to dismiss is that nothing obvious is wrong. The people speaking are thoughtful, careful, and highly competent. The discussion by researchers on Constitutional Classifiers is a good example of this [1]. They explain, in precise and measured terms, how they are trying to prevent large language models from being manipulated into producing harmful outputs. They describe how they define jailbreaks, how they construct a set of guiding rules, and how they use synthetic data to train classifiers that enforce those rules.
It is clear that the work is being done with genuine intent to reduce harm. It is also clear that progress is being made. And yet, the unease remains.
What is being assumed
The reason for that unease has less to do with what is being said, and more to do with what is being assumed. The conversation is framed as a technical problem, and within that frame it makes sense. If models can be tricked into producing dangerous or inappropriate content, then it is reasonable to build systems that make that harder to do. The work on Constitutional Classifiers is exactly that: an attempt to increase robustness, reduce misuse, and maintain usability at scale.
The difficulty arises when this kind of work is described as “alignment”.
The word suggests something far broader than the problem being addressed. Alignment, in its fuller sense, concerns the relationship between intelligence and human values. It asks how systems should behave in a world shaped by competing ideas of harm, freedom, justice, and meaning. These are not new questions. They have been explored across philosophy, religion, and culture for centuries, and they remain unresolved precisely because they do not admit simple answers.
When those questions are translated into a technical setting, they necessarily become narrower. They are reframed in terms of what can be defined, measured, and implemented. A model can be trained to refuse certain requests, to follow particular behavioural guidelines, and to respond in ways that align with a predefined set of rules. These behaviours can be tested, evaluated, and improved. But the act of making them measurable also changes their nature.
Flattening the tensions
Philosophical work on value pluralism highlights that human values are multiple and often in tension with one another [2]. A society might value both freedom of expression and protection from harm, yet find that these values conflict in practice. Resolving such conflicts requires judgement and context. It is an ongoing process rather than a fixed solution.
When alignment is expressed as a set of rules that can be enforced through classifiers, those tensions are flattened. The system becomes effective at applying boundaries, but it does not engage with the deeper reasons those boundaries exist. It enforces decisions that have already been made elsewhere.
This raises a more difficult question: where are those decisions being made?
In the current model, they are largely determined within a relatively small set of organisations, often corporate, by teams of researchers and engineers. These individuals are highly skilled in their fields. They understand how to build and evaluate complex systems. But their expertise is not the same as expertise in ethics across cultures, or in the historical dynamics of power, or in the lived experience of communities affected differently by harm.
This is not a critique of individuals. It is a recognition of perspective. Every group approaches a problem through the lens of its training and environment. In this case, the dominant lens is technical. Problems are defined in ways that can be formalised, and solutions are developed in ways that can be implemented and scaled. That is both the strength and the limitation of the approach.
Infrastructure, not just tools
The limitation becomes more visible when we consider the broader impact of these systems. Once deployed, AI models do not simply answer questions. They shape how questions are asked and what kinds of answers are available. They influence how people explore ideas, what information they can access, and how certain topics are framed. At that point, the system is no longer just a tool. It becomes part of the infrastructure through which knowledge is mediated.
The political philosopher Langdon Winner argued that technologies can embody particular forms of power and authority [3]. This is not always obvious, because the mechanisms are often indirect. A system does not need to explicitly enforce a worldview in order to influence one. It can do so through the accumulation of small design decisions about what is permitted, what is discouraged, and how responses are structured.
The appearance of neutrality can make this influence harder to see. Terms like “safety” and “harm reduction” suggest that the system is implementing broadly agreed principles. But definitions of harm are not universal. They vary across cultures, contexts, and historical moments. What one group considers harmful, another might consider necessary or even valuable. The philosopher Donna Haraway described knowledge as situated, meaning that it is always shaped by the perspective from which it is produced [4]. When AI systems present their outputs as neutral or objective, they can obscure the fact that they are grounded in specific assumptions.
This is where the composition of the group becomes significant. The current development of alignment systems is concentrated within a relatively narrow segment of society, both culturally and institutionally. The perspectives included in that process are not representative of the full range of human experience.
Global frameworks already recognise the importance of addressing this imbalance. UNESCO’s Recommendation on the Ethics of Artificial Intelligence emphasises that AI ethics should reflect cultural diversity, human rights, and broad participation [5]. It frames alignment as a collective responsibility, rather than something that can be resolved within a single domain. The contrast between that perspective and current practice is not subtle. One emphasises plurality and ongoing dialogue. The other emphasises implementation and control.
A much larger conversation
This does not mean that the technical work is misguided. The development of safer systems is important, and efforts to reduce misuse should continue. But it does mean that the scope of that work should be understood clearly.
What is being built through approaches like Constitutional Classifiers is not a complete solution to alignment. It is a set of tools for managing risk within AI systems. Those tools are necessary, but they are not sufficient. The larger questions remain open. They cannot be resolved by classifiers alone, nor by any single group, regardless of how capable that group may be.
If alignment is to mean anything beyond behavioural constraint, it will need to expand beyond its current frame. It will need to include perspectives that are not easily formalised, and voices that are not currently centred in the process. It will need to accept ambiguity and disagreement, rather than trying to eliminate them.
Until then, the work will continue to feel incomplete. Not because it is wrong, but because it is only one part of a much larger conversation that has yet to fully begin.
References
| Ref | Year | Author(s) | Title | Source |
|---|---|---|---|---|
| [1] | 2025 | Anthropic (Sharma, Wei, Perez, Tong et al.) | Constitutional Classifiers: Defending Against Jailbreaks | YouTube |
| [2] | 2023 | Stanford Encyclopedia of Philosophy | Value Pluralism | plato.stanford.edu |
| [3] | 1980 | Langdon Winner | Do Artifacts Have Politics? | jstor.org |
| [4] | 1988 | Donna Haraway | Situated Knowledges | jstor.org |
| [5] | 2021 | UNESCO | Recommendation on the Ethics of Artificial Intelligence | unesco.org |