The Alignment Problem Is a Mirror

What Is Alignment, Really?

The AI alignment problem is usually framed as a technical challenge: how do you build a system that reliably does what humans want? But the deeper you go, the more it reveals itself as a philosophical problem wearing an engineering costume.

The technical question — how do we specify human values in a reward function? — immediately runs into the philosophical problem: humans don’t agree on what human values are.

The Specification Problem

Consider something that seems simple: “be helpful.”

Helpful to whom? The individual user? Society at large? Future generations? What happens when being helpful to one person causes harm to another? What if the user wants something that’s bad for them?

Every edge case you consider reveals another layer of value disagreement baked into the human species. We’ve been arguing about ethics for thousands of years and we haven’t converged.

The alignment problem doesn’t just ask “how do we encode human values?” It asks “whose values?” and “which ones?” and “when they conflict, what then?”

A Mirror for Human Disagreement

This is why I think the alignment problem is one of the most philosophically interesting challenges of our time — not just as an engineering problem, but as a mirror.

Every time researchers attempt to formalize what AI should optimize for, they’re forced to make explicit the moral assumptions that normally remain implicit in human behavior. Constitutional AI, RLHF, debate — each approach encodes particular views about what good behavior looks like.

Those choices reflect the values of the people making them. And those people are a tiny, unrepresentative slice of humanity.

Three Layers of the Problem

Layer 1: Inner alignment — Does the model actually optimize for the objective we gave it, or does it learn a different objective that happens to perform well during training?

Layer 2: Outer alignment — Is the objective we specified actually what we wanted? (Almost certainly not, at the edges.)

Layer 3: Value alignment — Even if we perfectly specified some human’s values, is that the right set of values to optimize for?

Most alignment research focuses on layers 1 and 2. Layer 3 is where it gets genuinely hard — and genuinely interesting.

What This Means in Practice

The practical implication is that AI systems will inevitably encode the values of their creators — not because of malice, but because value neutrality is impossible. Every design decision is a moral decision.

This puts significant responsibility on the small number of people and organizations building these systems. The choices they make — about what to optimize, what to refuse, what to allow — will have outsized effects on society.

The question isn’t whether AI will have values. It’s whose values, made explicit.

That’s a conversation that extends far beyond the machine learning community. It’s a conversation for everyone.

This is the kind of thing I think about. If you do too, let’s connect on LinkedIn.