Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

May 1, 2026 · 7:13 PM UTC ·4 min read · 0 reactions · 0 comments · 2 views

#machine learning #llm evaluation #sales automation #benchmarking #natural language processing #Tenacious Consulting #τ²-Bench #GPT-4o-mini #Llama-3.1-70B #Qwen2.5-0.5B-Instruct #Google Colab #Li et al.#Lidya Dagnew

Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

⚡ TL;DR · AI summary

Tenacious Consulting developed Tenacious-Bench, a new evaluation benchmark for sales domain language models, addressing gaps in existing general-purpose LLM benchmarks that fail to assess buyer-segment alignment in outreach emails. The benchmark was built using a combination of programmatic generation, multi-LLM synthesis, and hand-authored adversarial examples to reflect real-world sales pipeline failure modes. A DPO-trained judge model achieved 74% accuracy in detecting judgment errors, significantly outperforming rule-based and zero-shot approaches.

Key facts

▪Tenacious-Bench addresses the lack of public benchmarks for evaluating B2B sales outreach emails tailored to specific buyer segments.
▪The dataset was constructed from scratch using a four-mode pipeline: programmatic expansion, multi-LLM synthesis, hand-authored adversarial cases, and contamination prevention checks.
▪Eight real-world failure modes were identified, including segment misrouting, tone drift, and AI maturity mismatch, each validated against actual pipeline data.
▪A DPO-trained Qwen2.5-0.5B-Instruct judge model using implicit reward achieved 74% accuracy, compared to 48% for rule-based evaluators and 22% for zero-shot prompting.
▪Contamination prevention included n-gram overlap checks, embedding similarity thresholds, and time-shift verification to ensure data integrity.

Original article

DEV.to (Top)

Read full at DEV.to (Top) →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3432333) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } lidya dagnew Posted on May 1 Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists #machinelearning #llm #python The Gap General-purpose LLM benchmarks like τ²-Bench evaluate task completion in retail domains - cancelling orders, processing returns, checking inventory.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed

Discussion

0 comments

Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

Discussion

More from DEV.to (Top)