Direct Preference Optimization Beyond Chatbots
The article discusses advancements in Direct Preference Optimization (DPO) for improving text transcription accuracy in OCR models. It highlights the limitations of supervised fine-tuning (SFT) in addressing text degeneration issues and presents DPO as a solution. The findings indicate that DPO significantly reduces degeneration rates across various model families, demonstrating its effectiveness as a training tool.
- ▪DharmaOCR, a specialized structured OCR model, was released to improve document extraction tasks.
- ▪Supervised fine-tuning often fails to reduce text degeneration to acceptable levels, with rates varying widely among models.
- ▪Direct Preference Optimization was shown to reduce text degeneration by an average of 59.4%, with some models achieving reductions of up to 87.6%.
Opening excerpt (first ~120 words) tap to expand
Back to Articles Direct Preference Optimization Beyond Chatbots Team Article Published June 3, 2026 Upvote - Erick Lachmann ErickvL Follow Dharma-AI Pimenta de Freitas Cardoso GabrielPimenta99 Follow Dharma-AI Using Rejection Pairs From Your Model's Own Failures The Loop Survives Fine-Tuning The Design Decision: Degenerate Outputs as Rejection Pairs Consistent Across Five Model Families The Pattern Beyond OCR Sources Using Rejection Pairs From Your Model's Own Failures In April, we released DharmaOCR, our specialized structured OCR model (available on Hugging Face) along with a paper detailing the methodology behind it and a benchmark demonstrating its superior quality and cost efficiency.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Hugging Face Blog.