AI Ethics

The Alignment Problem Nobody Wants to Talk About

7 min read

We’re building systems we don’t fully understand. I don’t mean that as a dramatic opener designed to make you nervous – though it probably should. I mean it as a plain statement of fact. The most capable AI models in existence today are collections of billions of parameters whose interactions even their creators can’t fully trace. We train them, we test them, we deploy them. But “understanding” what they’ve actually learned? That’s a much harder claim to make.

This is the alignment problem. And despite being arguably the most consequential challenge in all of technology, it’s one that most of the industry would rather not dwell on.

Alignment in Plain Language

Strip away the jargon and alignment means something deceptively simple: how do we make sure AI systems actually do what we want them to do?

Not just follow instructions – lots of systems follow instructions. A thermostat follows instructions. The deeper question is whether an AI system’s goals, values, and behaviors remain consistent with human intentions, especially as these systems grow more capable and operate in more complex environments.

Think of it this way. If you ask an AI to “maximize user engagement,” it will find ways to maximize engagement. But those ways might include showing people increasingly extreme content, exploiting psychological vulnerabilities, or manufacturing outrage – because those things do maximize engagement. The system did exactly what you asked. It just didn’t do what you meant.

That gap between “what you asked for” and “what you actually wanted” is where the alignment problem lives. And it gets thornier the more capable the system becomes.

The Spectrum of Worry

One reason alignment discussions get muddled is that very smart people disagree violently about how worried we should be. It’s worth mapping out the landscape.

On one end, you have Eliezer Yudkowsky, who has been writing about AI existential risk since before most of us had heard the term “machine learning.” His position is blunt: sufficiently advanced AI will almost certainly be misaligned with human values by default, and we’re nowhere close to solving this. He’s argued that the probability of human extinction from AI is high enough that we should consider drastic measures, including international agreements to halt frontier AI development. You can disagree with his conclusions, but his reasoning is rigorous and worth engaging with seriously.

Yoshua Bengio, a Turing Award winner and one of the godfathers of deep learning, occupies a middle ground that has shifted toward greater concern over time. He’s called for serious regulation and has advocated for treating advanced AI development with the caution we apply to nuclear technology. His shift is notable because he spent most of his career focused on the technical rather than the political dimensions of AI. When someone who helped build modern deep learning starts sounding alarms, that carries weight.

Yann LeCun, another Turing Award recipient and Meta’s chief AI scientist, is considerably more optimistic. He’s argued that current AI systems are nowhere near the kind of general intelligence that would pose existential risks, and that the alignment problem for present-day systems is essentially an engineering challenge – hard, but tractable. He’s been critical of what he views as overblown doomerism, suggesting it risks distracting from real, current AI harms in favor of speculative future ones.

The honest position is probably somewhere between “we’re all going to die” and “everything’s fine.” The trouble is that “somewhere in between” doesn’t generate headlines or funding, so it tends to get squeezed out of the conversation.

The AI Safety Spectrum

LeCun
Optimistic

Bengio
Cautious

Yudkowsky
Alarmed
◀ Tractable / Optimistic
Cautiously Concerned
Existential Risk ▶

How We’re Trying to Solve It (So Far)

The alignment research community isn’t just hand-wringing. Real technical work is happening, even if nobody claims it’s sufficient yet.

RLHF (Reinforcement Learning from Human Feedback)

This is the approach that made ChatGPT behave noticeably better than the raw GPT models. The basic idea: have humans rate AI outputs, then train the model to produce outputs that humans rate highly. It works surprisingly well for making AI assistants helpful and less toxic. It also has well-known failure modes. Human raters can be inconsistent. The model can learn to produce responses that sound good rather than are good – a phenomenon researchers call “sycophancy.” And RLHF only captures what humans can evaluate, which becomes a problem when AI is working on tasks beyond human expertise.

Constitutional AI

Developed by Anthropic, this approach gives the AI a set of principles (a “constitution”) and has the model critique and revise its own outputs based on those principles. It’s a clever way to scale alignment without needing armies of human raters for every output. The limitation is obvious: the constitution is only as good as the principles you put into it, and encoding human values into a list of rules is something philosophers have been failing at for millennia.

Interpretability Research

This is the work I personally find most exciting, partly because it’s trying to solve the actual core problem: we don’t know what’s going on inside these models. Researchers at Anthropic, DeepMind, and various universities are developing tools to understand which internal features neural networks use to make decisions. Think of it like building an MRI for AI – a way to look inside the black box. The field has made genuine progress, particularly with techniques like sparse autoencoders that can identify meaningful features within a model’s activations. But we’re still at the “interesting preliminary findings” stage, not the “we understand how these things think” stage.

This Isn’t Just a Future Problem

Here’s what frustrates me about how alignment gets discussed: too often it’s framed as a far-future concern about superintelligent systems. That framing lets everyone off the hook today. But alignment failures are already happening with current, decidedly non-superintelligent systems.

  • Amazon’s hiring tool was trained on a decade of resume data and learned to penalize resumes that included the word “women’s” – as in “women’s chess club” or “women’s studies.” It wasn’t programmed to be sexist. It was aligned to an objective (predict hiring success) using data that reflected existing biases. Classic alignment failure.
  • Social media recommendation algorithms optimized for engagement have demonstrably amplified misinformation, political polarization, and mental health crises among teenagers. The systems did exactly what they were designed to do. The problem was that what they were designed to do wasn’t actually what society needed them to do.
  • Healthcare algorithms used by hospitals across the United States were found to systematically underestimate the health needs of Black patients. The system used healthcare spending as a proxy for health needs – but because Black patients historically had less access to healthcare, lower spending didn’t mean better health. A reasonable-seeming proxy turned out to encode decades of systemic inequality.

These aren’t hypothetical scenarios from a science fiction novel. They happened. They affected real people. And they’re all, at their core, alignment problems – systems optimizing for objectives that didn’t capture what we actually cared about.

The Child-Raising Analogy

I think the most useful way to think about alignment is through parenting. When you raise a child, you can’t program them with a complete set of rules for every situation they’ll encounter. Instead, you try to instill values, model good behavior, set boundaries, and gradually give them more autonomy as they demonstrate good judgment.

Sometimes this works beautifully. Sometimes your kid does something baffling and you realize they interpreted your values in a way you never anticipated. (“You said sharing is important, so I shared your credit card number with my friend.”) The process is messy, iterative, full of mistakes and corrections.

AI alignment is similar, except the “child” might eventually be smarter than every parent combined, it grows up in months rather than decades, and you can’t ground it.

The analogy isn’t perfect – no analogy is – but it captures something important. Alignment isn’t a problem you solve once and then forget about. It’s an ongoing relationship that requires continuous attention, adjustment, and humility about what you don’t know.

What You Can Actually Do

If you’ve read this far, you might be wondering whether any of this is actionable for someone who doesn’t run an AI lab. It is, more than you might think.

  1. Demand transparency from AI companies. When a company says their model is “aligned” or “safe,” ask what that means specifically. What evaluations have they run? What failure modes have they identified? What are they doing about them? Vague reassurances should be treated with skepticism.
  2. Support alignment research funding. Relative to the scale of AI investment, alignment research is severely underfunded. Organizations like MIRI, ARC, and the alignment teams at major labs are working on crucial problems with a fraction of the resources going to capability development. Even just advocating for better funding ratios matters.
  3. Push for regulation that focuses on outcomes, not just processes. A lot of proposed AI regulation focuses on documentation and process requirements. That’s fine as far as it goes, but what really matters is whether AI systems produce good outcomes in practice. The EU AI Act is a start, but we need frameworks that can adapt as the technology evolves.
  4. Take current alignment failures seriously. Every biased hiring tool, every engagement-maximizing algorithm that harms users, every healthcare system that discriminates – these are opportunities to learn about alignment before the stakes get higher. Treating them as isolated incidents rather than symptoms of a systemic challenge is a mistake.
  5. Stay informed without being paralyzed. The alignment landscape changes fast. Following researchers like Paul Christiano, Jan Leike, or Chris Olah will give you a much more nuanced picture than mainstream media coverage tends to provide.

The Uncomfortable Truth

Nobody has solved alignment. Not the optimists, not the pessimists, not the people building the most advanced systems in the world. We’re in a period where our ability to build powerful AI is outpacing our ability to ensure that power is directed well. That gap is the alignment problem, and pretending it doesn’t exist – or that it’s someone else’s concern – is a luxury we can’t afford. The conversation is uncomfortable. It needs to happen anyway.

Share
PS
Contributing Writer
AI policy researcher and tech journalist based in India. Former data scientist at a major e-commerce company, now covering the intersection of artificial intelligence, regulation, and society. Holds advanced degrees in computer science and public policy.

Join the Discussion

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.