Training a small model to write better OCaml with RLVR and GRPO
Kiran Gangadharan explores the training of a small language model to improve OCaml code generation using RLVR and GRPO techniques. The experiment involved training a 1.5B model on a dataset derived from public GitHub repositories, focusing on OCaml's unique syntax. Key aspects included defining constraints for local inference and using a graduated reward system to enhance the model's learning process.
- ▪The model was trained on a small dataset of programming problems adapted to OCaml.
- ▪A single rented GPU was used for training, with LoRA to reduce memory requirements.
- ▪The training loop utilized Hugging Face's trl library for GRPO integration.
Opening excerpt (first ~120 words) tap to expand
Kiran Gangadharan Training a small model to write better OCaml with RLVR and GRPO 18 May 2026 For a while now, I’ve been interested in exploring the capabilities of small language models. When my colleague Atharva introduced me to RLVR and GRPO for doing RL training without a human feedback loop, I wanted to know more. In the previous post, we explored the workings of RLVR and GRPO. In this post, I’ll walk through a code-generation experiment where I trained a small 1.5B model with GRPO, improved its ability to generate correct and valid OCaml code, and share what I learned along the way.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at nilenso blog.