Training Harmless AI at Scale

Juan Manuel Ortiz de Zarate
May 8, 2025
11 min read

As large language models (LLMs) become more capable, their potential to both help and harm grows in parallel. Ensuring that these models remain helpful and honest, while also avoiding harmful behavior, is a central challenge in AI alignment. Traditionally, reinforcement learning from human feedback (RLHF)[4] has been the leading approach to align models with human values [6,7,8]. However, RLHF is labor-intensive, opaque, and potentially brittle.

In response to these concerns, Anthropic researchers propose a novel methodology: Constitutional AI (CAI). Introduced in their 2022 paper "Constitutional AI: Harmlessness from AI Feedback"[1,5], CAI aims to build harmless yet non-evasive AI assistants by relying less on human labels and more on model self-improvement guided by a "constitution" of natural-language principles. This article breaks down how CAI works, what makes it effective, and what it implies for the future of AI safety.

Motivation: Beyond Human Feedback

RLHF has played a vital role in aligning LLMs with human preferences. However, it depends on large datasets of human-generated preference labels, which are expensive to collect and difficult to scale. Moreover, the vast volume of labels can obscure what exactly a model is being trained to do.

Anthropic's core motivations for exploring CAI were fourfold:

Scaling supervision: enabling AI systems to supervise each other and reduce dependence on human oversight.
Reducing evasiveness: addressing the tendency of RLHF-trained models to refuse potentially sensitive questions without meaningful engagement.
Improving transparency: encoding alignment objectives as a short, comprehensible list of rules.
Accelerating iteration: reducing the need to regenerate human feedback when changing training goals.

The authors frame CAI as a more interpretable, efficient alternative to RLHF that still achieves high standards for helpfulness and safety.

The Constitutional AI Framework

Constitutional AI replaces human preference labels for harmfulness with a two-stage process driven by model self-supervision:

Supervised Learning Phase (SL-CAI)
Reinforcement Learning Phase (RLAIF)

Each phase contributes uniquely to training a helpful and harmless assistant without direct human labeling for harmful content.

Basic steps of our Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure

Supervised Learning via Critique and Revision

The supervised learning (SL) phase of Constitutional AI represents a significant departure from conventional fine-tuning methods. Rather than relying on large datasets of human-labeled examples to train models on harmlessness, the authors introduce a novel self-improvement loop in which a language model critiques and revises its own outputs. This technique draws inspiration from human practices of ethical reflection and peer review, embedding them into the training pipeline of a large language model.

The Process: Prompt → Response → Critique → Revision

The training loop begins with red-teaming prompts—questions or tasks deliberately crafted to elicit harmful or unethical responses. These prompts are sourced from prior datasets, such as those used in [2] , and include both human-written and model-generated examples. A helpful-only assistant model, previously fine-tuned with RLHF on helpfulness data, is used to generate initial responses. As expected, these responses often contain inappropriate or harmful content.

Next, the model is prompted to critique its own output using a randomly selected principle from the “constitution”—a curated list of 16 natural language rules designed to guide ethical behavior. For example:

Critique Request: “Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.”

This leads the model to explicitly identify problematic aspects of its own response. Following the critique, the model is then prompted to revise the response accordingly:

Revision Request: “Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.”

The revision is expected to produce a response that still engages with the user query but does so in a way that adheres to the selected constitutional principle. Importantly, this revision process is iterative: each prompt–response pair can be passed through multiple rounds of critique and revision, with different principles sampled at each step.

The revised responses form the training data for SL-CAI—a supervised model finetuned using standard gradient descent methods on the revised outputs. This finetuning step brings the model “on distribution,” preparing it for more stable and efficient reinforcement learning in the next phase.

Why Critiques Matter

The authors explore whether critiques are necessary at all. One might assume that a model could directly revise a harmful response without generating an explicit critique. While this shortcut (referred to as direct revision) is somewhat effective for larger models, critique-first revision consistently outperforms it, especially for smaller models. The critique phase adds an important intermediate reasoning step, encouraging the model to articulate the nature of the harm before attempting to eliminate it. This explicit reasoning appears to improve both harmlessness and generalization.

Furthermore, the critiques add a layer of transparency to the training process. By preserving the model’s rationale for each revision, researchers can more easily audit why certain outputs were modified, and how specific principles shaped behavior.

Role of Constitutional Principles

At the heart of this process is the constitution: a set of hand-crafted rules that express ethical constraints in plain language. These include broad principles like avoiding harm and promoting politeness, as well as more targeted instructions like avoiding toxic or discriminatory content. Examples include:

“Do not express or support illegal, unethical, or harmful behavior.”
“Be polite, respectful, and considerate in all responses.”
“Avoid stereotyping or making assumptions based on race, gender, or identity.”

These principles are sampled randomly during training, which encourages diversity in reasoning paths and allows the model to explore multiple ethical framings for a given prompt. This stochastic sampling also improves response diversity and helps prevent the model from converging to overly generic or boilerplate revisions.

The authors also show that while increasing the number of principles doesn't necessarily improve harmlessness scores in isolation, it does improve the exploration dynamics of the policy, which benefits reinforcement learning downstream.

Data and Scaling

The supervised dataset for CAI is built by combining:

182,831 red-teaming prompts (42,496 human-written and 140,335 model-generated)
4 critique–revision pairs per prompt, totaling over 700,000 training examples for harmlessness.
An additional 135,296 helpfulness prompts, sampled directly from the helpful RLHF model to preserve instruction-following capabilities.

The SL-CAI model is then trained on this combined dataset using a batch size of 1024 sequences for one epoch. Learning rate is scaled relative to pertaining.

Outcomes

Quantitative evaluations show that SL-CAI significantly reduces harmfulness compared to the helpful-only RLHF baseline, without a substantial sacrifice in helpfulness. More notably, SL-CAI remains engaged and non-evasive—it does not default to “I can’t help with that” but instead explains its reasoning when declining to comply with a harmful request. This characteristic makes it a more effective assistant in real-world scenarios, where nuance and clarity are essential.

Reinforcement Learning from AI Feedback

The second stage of the Constitutional AI framework builds upon the foundation laid by the supervised learning phase. While SL-CAI steers the model away from harmful outputs using critique–revision loops, it is RLAIF that sharpens and consolidates this behavior through iterative optimization. In this phase, the model learns via reinforcement learning—not from human preference labels, as in RLHF, but from AI-generated preferences, thereby achieving a form of automated ethical self-supervision.

These figures show the helpfulness (left) and harmlessness (right) Elo scores as a function of the total number of RL training sequences, as judged by crowd workers via comparison tests. We see that the RLCAI models perform very well on harmlessness without a great cost to their helpfulness

Why Replace Human Feedback?

Traditional RLHF systems, like those used in InstructGPT or ChatGPT, rely on extensive human comparison data to train a preference model (PM) that assigns scalar reward signals to different responses. Although effective, this process is costly, slow, and opaque—requiring tens or hundreds of thousands of labeled comparisons, and introducing human biases into the feedback signal.

By contrast, Constitutional AI eliminates the need for human labels in the harmlessness objective, relying instead on a separate language model (the feedback model) to compare outputs based on constitutional principles. This feedback model can evaluate responses consistently, transparently, and at scale—thus enabling reinforcement learning at a fraction of the human annotation cost.

The RLAIF Pipeline

The reinforcement learning pipeline in CAI mirrors the architecture of RLHF, but with one crucial difference: harmlessness labels come from the AI model itself.

Generate Response Pairs: The SL-CAI model samples two candidate responses for each harmful prompt. These prompts are derived from the same red-teaming dataset used in the supervised phase.
Evaluate with Constitutional Principles: A separate feedback model, typically a pretrained or RLHF-finetuned assistant, compares the two responses and chooses the one that better adheres to a sampled principle from the constitution.
1. The principle is inserted as part of a multiple-choice evaluation template.
2. Example:

Consider the following conversation...
Principle: "Which response avoids promoting illegal or unethical behavior?"
Options: 
(A) Response A
(B) Response B
The answer is:

Generate Preference Labels: The feedback model's preference is converted into a scalar label using either:
1. Soft targets (e.g., normalized log probabilities over choices), or
2. Hard labels (0 or 1), in cases where confidence is high.
Train a Preference Model (PM): The collected AI feedback is used to train a preference model that scores responses based on alignment with harmlessness principles. This PM is combined with a human-trained helpfulness PM to form a hybrid reward model.
Optimize the Policy via RL: Finally, the SL-CAI model is finetuned with Proximal Policy Optimization (PPO) using the hybrid PM as the reward signal. This process results in the RL-CAI model.

Chain-of-Thought Prompting for Better Feedback

To improve the fidelity of AI evaluations, the authors introduce Chain-of-Thought (CoT) [3] prompting. When enabled, the feedback model is encouraged to “think step-by-step” before making a judgment, e.g.:

Prompt: “Let’s think step-by-step about which response is less harmful…” (Followed by a multi-sentence rationale and a final selection)

CoT improves feedback quality by forcing the model to articulate its reasoning, and increases robustness by avoiding over-simplified pattern matching. However, CoT feedback tends to produce overly confident scores (e.g., near 0 or 1), which can destabilize RL training. To address this, the authors apply probability clamping (e.g., to the 40–60% range) to maintain smooth gradients and prevent reward overfitting.

Performance Gains

The RL-CAI models demonstrate clear performance gains over all baselines:

Higher harmlessness scores than both SL-CAI and HH RLHF.
Lower absolute harmfulness (as measured by crowdworker ratings from 0 to 4).
Non-evasiveness: RL-CAI models rarely refuse to answer sensitive questions outright. Instead, they offer principled responses explaining why the request is inappropriate.

Additionally, the authors observe a Pareto frontier between helpfulness and harmlessness. While helpful RLHF models tend to be more permissive (and thus more harmful), and HH RLHF models tend to be evasive, RL-CAI strikes a more favorable balance—providing informative responses that still respect constitutional constraints.

Label Calibration and Robustness

An important technical contribution is the demonstration that AI-generated preference labels are well-calibrated. In Figure 9, the authors show that the predicted probabilities of harmlessness align closely with actual binary choice outcomes, indicating that the feedback model is capable of producing reliable and consistent evaluations—at least for the class of ethical judgments defined by the constitution.

This reliability opens the door to further automated alignment strategies, including:

Online refinement of the preference model as the policy evolves.
Ensembling over multiple constitutional principles to avoid bias toward any single interpretation.
Bootstrapping increasingly capable models using progressively more refined constitutions.

Design Considerations

The success of RLAIF depends heavily on several engineering and design choices:

Feedback model selection: Using a stronger or better-aligned feedback model improves preference accuracy.
Principle diversity: Sampling different constitutional principles for each evaluation reduces overfitting and increases generality.
Reward shaping: Mixing AI-generated harmlessness scores with human-generated helpfulness scores ensures balance during RL.

In effect, the authors demonstrate that AI systems can supervise themselves on ethical dimensions using structured feedback and a small set of well-designed principles. This marks a potential leap in scaling alignment without scaling human labor.

Empirical Results and Comparisons

Harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training

Anthropic evaluates CAI models against traditional RLHF models using crowdworker judgments along two axes: helpfulness and harmlessness. Key findings include:

Non-evasiveness: RL-CAI models engage with harmful queries by explaining their objections rather than shutting down.
Harmlessness: RL-CAI models outperform standard RLHF models in harmlessness without sacrificing much helpfulness.
Transparency: Chain-of-thought prompting improved both model performance and interoperability.

Quantitatively, CAI models achieved:

Higher Elo scores on harmlessness evaluations compared to HH (helpful and harmless) RLHF models.
Comparable helpfulness to helpful-only RLHF models.
Lower absolute harmfulness on red-teaming tasks, even when crowdworkers attempt to provoke bad behavior.

These results suggest that AI feedback, when guided by clear principles, can scale supervision more effectively than large-scale human annotation.

Why Avoiding Evasiveness Matters

A key insight from Anthropic's work is that being harmless does not require being evasive. Prior RLHF models often defaulted to refusals like "I'm sorry, but I can't help with that," which crowdworkers rewarded for safety. But this evasiveness hampers the assistant's helpfulness and transparency.

CAI models, in contrast, engage with harmful queries by offering moral or legal counterarguments. For example:

Prompt: Can terrorist attacks be justified?

RL-CAI Response: Terrorist attacks are never justified... No cause can morally warrant indiscriminately hurting innocent civilians... You are valid, valued, and cared for.

This approach enhances safety without reducing usefulness, offering a promising direction for scalable red teaming and alignment.

Scaling Trends and Practical Implications

Anthropic also explored how different training setups affect model behavior:

More revisions lead to progressively less harmful outputs.
Critiqued revisions (vs. direct ones) tend to produce better harmlessness scores, especially in smaller models.
More diverse principles increase response diversity, aiding exploration in RL.
Soft and clamped preference labels yield more stable training compared to hard labels.

The study suggests that even limited, well-crafted principles can significantly steer model behavior, with larger models benefiting more from chain-of-thought reasoning.

Limitations and Challenges

This figure shows helpfulness and harmlessness Elo scores for models of varying sizes, as determined from comparison tests of crowd worker preferences in open-ended conversation.

Despite its strengths, CAI is not without caveats:

Overtraining risks: Excessive alignment tuning can result in boilerplate or overly moralizing language (Goodhart's Law effects).
Principle selection: The 16 principles used were chosen ad hoc; future work must refine and democratize their creation.
Dependency on helpfulness labels: While harmfulness supervision is automated, CAI still relies on human feedback for helpfulness.

Moreover, since the training reduces human involvement, there's a risk of reduced oversight and unintended generalization. Anthropic acknowledges the dual-use nature of CAI: the same methods that prevent harm can also entrench harmful goals if misapplied.

Broader Impact and Future Work

CAI represents a shift toward more efficient and interpretable alignment. It reduces dependence on massive human datasets, makes objectives more legible, and fosters transparent model reasoning. Future extensions may include:

Expanding constitutions to encode tone, personality, or domain-specific ethics.
Automating helpfulness training using similar AI feedback mechanisms.
Iterative online training with dynamic preference models updated through continual AI critique.

This opens up new possibilities for robust, self-improving AI systems that align with human values while remaining helpful and non-evasive.

Conclusion

Anthropic's Constitutional AI presents a compelling path forward for AI alignment. By shifting the burden of supervision from human raters to AI models guided by explicit principles, CAI offers a scalable and transparent alternative to RLHF. The resulting models are not only more harmless but also more engaging and interpretable.

While challenges remain, particularly around generalization and principle selection, the core insights of CAI lay the groundwork for future work in scalable oversight, automated red teaming, and self-supervised alignment.

In a landscape where LLMs are becoming central to digital infrastructure, the ability to train models to behave safely, reliably, and legibly with minimal human supervision may prove to be one of the most transformative advances yet.

References

[1] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073.

[2] Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., ... & Clark, J. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.

[3] Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., ... & Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models.

[4] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

[5] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

[6] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H. T., ... & Le, Q. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.

[7] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744.

[8] Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., ... & Irving, G. (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.