Training a small model to write better OCaml with RLVR and GRPO

nilenso· May 20, 2026 · 6:27 PM UTC ·12 min read · 0 reactions · 0 comments · 21 views

TL;DR · WeSearch summary

Kiran Gangadharan explores the training of a small language model to improve OCaml code generation using RLVR and GRPO techniques. The experiment involved training a 1.5B model on a dataset derived from public GitHub repositories, focusing on OCaml's unique syntax. Key aspects included defining constraints for local inference and using a graduated reward system to enhance the model's learning process.

Key facts

▪The model was trained on a small dataset of programming problems adapted to OCaml.
▪A single rented GPU was used for training, with LoRA to reduce memory requirements.
▪The training loop utilized Hugging Face's trl library for GRPO integration.

Original article

nilenso blog · nilenso

Read full at nilenso blog →

Opening excerpt (first ~120 words) tap to expand

Kiran Gangadharan Training a small model to write better OCaml with RLVR and GRPO 18 May 2026 For a while now, I’ve been interested in exploring the capabilities of small language models. When my colleague Atharva introduced me to RLVR and GRPO for doing RL training without a human feedback loop, I wanted to know more. In the previous post, we explored the workings of RLVR and GRPO. In this post, I’ll walk through a code-generation experiment where I trained a small 1.5B model with GRPO, improved its ability to generate correct and valid OCaml code, and share what I learned along the way.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at nilenso blog.

Anonymous · no account needed

Discussion

0 comments

Training a small model to write better OCaml with RLVR and GRPO

Discussion

More from nilenso blog