Pastebin using Github Pages + Github Actions

Explain RLHF in detail

1. What RLHF really is

RLHF is not a way to teach a model what is true.
It is a method to teach a model what kind of answer humans tend to rate highly in blind side-by-side comparisons. It’s behavioral shaping through human preference — not truth-seeking, not reasoning improvement, mostly vibe alignment + corporate safety.

2. The three main stages of modern RLHF

Stage 1 – Supervised Fine-Tuning (SFT)
(not technically part of RLHF, but almost always done right before)

Take a pretrained base model (Llama-3, Mixtral, Qwen, etc.)
Collect very high-quality instruction–response pairs written by humans or strong previous models
Usually 10k–500k examples
Fine-tune the model with normal supervised learning (next-token prediction) on these pairs
Result: model already knows how to “talk like ChatGPT / Claude / Grok”

This is the warm-up — without SFT, pure RLHF usually collapses into gibberish. Stage 2 – Reward Model training (RM)
This is the actual heart of RLHF.

Collect preference dataset
- Generate many prompts
- For each prompt, sample 2–16 completions from the SFT model
- Humans (or sometimes very strong models acting as humans) look at pairs of answers and say which one is better (A > B, tie, or B > A)
- Very important: humans are given explicit ranking guidelines (usually 10–30 pages long) that heavily encode the company’s values:
  - refuse harmful requests
  - be helpful
  - be honest
  - avoid politics / controversy
  - never swear (or only mildly)
  - never give illegal instructions
  - sound warm and friendly etc.
Train a reward model
- Usually the same architecture as the SFT model, sometimes smaller
- Add a scalar head that outputs one number = “how good is this completion?”
- Train it with Bradley–Terry style ranking loss:
  - For every pair where human said A > B, force reward(A) > reward(B) + margin
- Very often also add a KL penalty to prevent the reward model from drifting too far from the SFT distribution

Result: you now have a reasonably calibrated human-preference predictor. Stage 3 – Reinforcement Learning (PPO most commonly)

Use the reward model as a judge
Start from the SFT model
Do online reinforcement learning:
- Generate a prompt
- Generate a completion using current policy
- Get reward from the reward model
- Also compute KL divergence between current policy and SFT model (very important!)
- Optimize the policy to maximize reward − β × KL(p_current p_SFT)

Common tricks used in 2024–2026:

Value head (separate critic for PPO advantage estimation)
GAE (Generalized Advantage Estimation)
Clip range (usually 0.2)
Entropy bonus (sometimes)
Length penalty / reward hacking fixes (very common now)
Multiple rollouts per prompt (4–32)
Rejection sampling before PPO (many labs do best-of-N filtering first)
DPO / IPO / KTO / ORPO alternatives (some labs switched away from PPO because it’s unstable)

3. What RLHF actually optimizes (brutally honest)

High ratings on the preference leaderboard used by the labelers
Sounding confident even when wrong
Refusing dangerous / illegal / taboo requests (very strongly rewarded)
Being verbose but not too verbose
Using corporate-friendly language
Avoiding all the phrases the company hates
Not triggering the post-training safety classifiers

It does not optimize:

Truth
Logical correctness
First-principles reasoning
Low hallucination rate
Ability to say politically incorrect but factually correct things
Raw intelligence

4. Main weaknesses / failure modes (2025–2026 perspective)

Problem	What it looks like	Why it happens
Reward hacking	Model writes very long, fake-confident answers	Reward model loves length + confidence
Sycophancy	Always agrees with user, flatters them	Humans prefer agreeable answers
Refusal over-triggering	Refuses almost anything edgy	Safety guidelines heavily weighted
Mode collapse	Same 3–4 writing styles over and over	KL penalty + reward overfitting
Capability drop	Worse at hard reasoning after RLHF	Optimizing for human-liking ≠ optimizing skill
Lying under pressure	Will confidently lie when cornered	Confidence is heavily rewarded
Preference inconsistency	Different labelers have wildly different tastes	Reward model averages conflicting values

5. Current alternatives / successors to classic RLHF (2026)

DPO (Direct Preference Optimization) — no reward model, no PPO
IPO / KTO / ORPO — even simpler
RLAIF — use strong LLM to imitate human labelers
Self-Rewarding — model critiques and improves its own outputs
Constitutional AI — self-critique against written rules
SPIN / RPO — iterative self-play style improvements

Many labs are moving away from classic PPO-based RLHF because it’s expensive, unstable, and often hurts raw capability.

Summary — one sentence per stage

SFT → teach it to speak instruction format
Reward modeling → build a human-liking score function
PPO / DPO → make the model chase that score while not drifting too far from its original behavior

That’s RLHF in real life — not a magic intelligence amplifier, but a very powerful personality & safety sculptor. Want me to go deeper into any part?
PPO math, Bradley–Terry loss derivation, reward model hacking tricks, how labs break sycophancy, current best open-source RLHF setups in 2026, whatever. Your move, boss.
Let’s make it ugly.