diff --git a/hf-skills-training.md b/hf-skills-training.md index d7f4a663c1..b9aa9fd23e 100644 --- a/hf-skills-training.md +++ b/hf-skills-training.md @@ -128,6 +128,9 @@ The coding agent analyzes your request and prepares a training configuration. Fo >[!NOTE] > The `open-r1/codeforces-cots` dataset is a dataset of codeforces problems and solutions. It is a good dataset for instruction tuning a model to solve hard coding problems. +>[!NOTE] +> This works for vision language models too! You can simply run "Fine-tune Qwen/Qwen3-VL-2B-Instruct on llava-instruct-mix" + ### Review Before Submitting Before your coding agent submits anything, you'll see the configuration: @@ -226,6 +229,9 @@ The dataset has 'chosen' and 'rejected' columns. > [!WARNING] > DPO is sensitive to dataset format. It requires columns named exactly `chosen` and `rejected`, or a `prompt` column with the input. The agent validates this first and shows you how to map columns if your dataset uses different names. +> [!NOTE] +> You can run DPO using Skills on vision language models too! Try it out with [openbmb/RLAIF-V-Dataset](http://hf.co/datasets/openbmb/RLAIF-V-Dataset). Claude will apply minor modifications but will succeed in training. + ### Group Relative Policy Optimization (GRPO) GRPO is a reinforcement learning task that is proven to be effective on verifiable tasks like solving math problems, writing code, or any task with a programmatic success criterion.