demo

Your taste, encoded as a model

Pick the better response on a series of paired completions. Watch a reward model "learn" your preferences. The pattern behind RLHF, DPO, and every aligned chat model.

Anchored to 10-fine-tuning/rlhf-dpo-grpo.