demo
Your taste, encoded as a model
Pick the better response on a series of paired completions. Watch a reward model "learn" your preferences. The pattern behind RLHF, DPO, and every aligned chat model.
Anchored to 10-fine-tuning/rlhf-dpo-grpo.
demo
Pick the better response on a series of paired completions. Watch a reward model "learn" your preferences. The pattern behind RLHF, DPO, and every aligned chat model.
Anchored to 10-fine-tuning/rlhf-dpo-grpo.