Show HN: Reward Is Not Reinforcement Until Admitted
The article discusses an experiment designed to test the thesis that reward is not reinforcement until it is admitted. It outlines a ranking-only setup that evaluates synthetic coding tasks using various selectors to determine the best patch outcomes. The results and metrics from the experiment are documented in multiple reports and JSON files for further analysis.
- ▪The experiment uses a ranking-only setup rather than model fine-tuning.
- ▪It compares different selectors based on their ability to choose the best patch outcomes.
- ▪Results are documented in various reports and JSON files for detailed analysis.
Opening excerpt (first ~120 words) tap to expand
Governed Reward Experiment Minimal runnable experiment for the thesis: Reward is not reinforcement until admitted. The experiment uses a ranking-only setup rather than model fine-tuning. Each synthetic coding task receives several candidate patch outcomes. A raw selector chooses the patch with the highest raw reward, while a governed selector chooses the patch with the highest admitted reward after invariant, exploit, causal, hidden-test, and delayed-regression checks. Run python3 governed_reward_experiment.py Optional parameters: python3 governed_reward_experiment.py --tasks 100 --candidates 7 --seed 11 Run the multi-seed selector and ablation suite: python3 governed_reward_experiment.py \ --suite \ --tasks 100 \ --seed-start 10 \ --seed-end 30 \ --candidate-grid 3,5,7,10 \ --out…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.