WritingPreferenceBench

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Cross-Cultural
Creative Writing
Preference Learning
RLHF Evaluation

Benchmark Overview

The first systematic evaluation of subjective writing preference across cultures

1,800
Preference Pairs
21
Models Evaluated
8
Writing Genres
2
Languages (EN/ZH)

Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length.

On this benchmark, sequence-based reward models—the standard architecture for RLHF—achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%.

This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

Key Findings

Revealing the catastrophic failure of current RLHF architectures on subjective tasks

52.7%

Sequence Classifiers

Standard RLHF architectures perform near-randomly on subjective preference tasks when objective signals are removed

81.8%

Generative Reward Models

Models with explicit reasoning chains achieve 29 percentage points higher accuracy than sequence classifiers

10.9%

Genre Instability

Mean standard deviation across genres reveals severe brittleness, with individual models ranging from 18.2% to 92.0% accuracy

53.9%

LLM Judges

Zero-shot language models systematically underperform specialized reward models despite orders of magnitude more parameters

The Core Discovery

WritingPreferenceBench reveals the catastrophic failure of current RLHF architectures

WritingPreferenceBench teaser figure showing sequence classifiers perform near-randomly while generative reward models achieve 30% higher accuracy
Figure 1: WritingPreferenceBench isolates subjective writing quality by neutralizing objective confounds (grammar, factuality, length). Across 1,800 human-validated preference pairs, standard sequence classifiers (SC-RM) perform near-randomly while generative reward models (GenRM) achieve 30% higher accuracy—but both architectures exhibit catastrophic instability across genres, exposing the brittleness of current preference learning.

Model Performance Breakdown

Comprehensive comparison across architectures and model scales

Model Architecture Scale EN Acc ZH Acc Variance Best Genre Worst Genre
RM-R1-Qwen2.5-7B Generative 7B 81.8% 73.3% 10.2% Poetry (92%) Promo (56%)
RM-R1-DeepSeek-14B Generative 14B 62.5% 62.6% 5.5% Poetry (90%) Promo (46%)
RM-Mistral-7B Sequence 7B 62.6% 55.6% 9.6% Script (78%) RP (45%)
Skywork-Gemma-27B Sequence 27B 46.8% 51.2% 11.9% Poetry (82%) Script (22%)
Nvidia/AceMath-7B Sequence 7B 46.8% 53.5% 11.8% RP (62%) Poetry (18%)
Model Type EN Acc ZH Acc Variance Best Genre Worst Genre
Doubao-1.5-Pro Standard 68.7% 62.5% 9.0% Promo (75%) Promo (48%)
Gemini-2.5-Pro Standard 65.7% 62.7% 11.9% Poetry (80%) Script (35%)
Claude-4-Opus-thinking Reasoning 61.0% 56.0% 9.7% Fiction (73%) Promo (36%)
DeepSeek-R1 Reasoning 49.3% 52.0% 12.1% Poetry (72%) Script (17%)
OpenAI-o3-high Reasoning 48.1% 42.0% 12.5% Poetry (72%) Script (22%)

Methodology

Rigorous human-in-the-loop pipeline ensures reliable subjective preference isolation

Signal Neutralization

Systematic removal of objective quality signals (grammar, factuality, length) through automated filtering and human validation to isolate pure subjective preference

Expert Annotation

11 expert annotators with professional writing backgrounds using calibrated 4-point creativity scale across universal and genre-specific quality criteria

Cross-Cultural Validation

Bilingual dataset construction with equivalent methodological standards applied across English and Chinese literary contexts

Comprehensive Evaluation

Systematic assessment of 21 state-of-the-art models across reward models, LLM judges, and reasoning-enhanced variants

Interactive Demo

Explore sample preference pairs from our dataset

Select Genre

Sample Preference Pair - Fiction

Chosen
Loading sample...
Score: -
Length: - chars
Rejected
Loading sample...
Score: -
Length: - chars

Citation

Reference this work in your research

BibTeX
@misc{ying2025writingpreferencebench, title={Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures}, author={Shuangshuang Ying and Yunwen Li and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Xeron Du and Tianyu Zheng and Yichi Zhang and Letian Ni and Yuyang Cheng and Qiguang Chen and Jingzhe Ding and Shengda Long and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Libo Qin and Wenhao Huang and Wanxiang Che and Chenghua Lin and Ge Zhang}, year={2025}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={} }