VoiceWriter Voice-Judge Calibration Record

What the judges
actually measure

We built six "same-hand" voice judges. One reads English authorial voice cleanly. For Chinese, the instrument scores recognition of the text — not the author's hand.

Dated2026 · 06 · 16
SpanSessions 23–30
Methodjudge_methodology.md
Flint · reads voice 马伯庸 · recognition-bound
00

Summary

The intuitive same-hand judge reads English authorial voice without recognition — strip a real Flint passage of its names and the judges still credit the hand. For Chinese it is recognition-bound.

A competent AI fake of a recognized scene scores as high as real Ma (~6.7), while real Ma with its names stripped floors (1.6). So 马伯庸's "band 6.0" is a recognition line, not a voice threshold — no unrecognized text clears it, real Ma included — and our Chinese generation's "below band" was an artifact: the judge never measured our voice.

The asymmetry tracks language, not author — confirmed directly (Session 30, B24b): on the same clean 7-model slate, four capable judges read transposed real Flint above the fakes (Δ +1.9 to +6.4) while none read transposed/obscure real Ma above the fakes. So it is a model-capability wall, not a calibration bug. Flint stays production; the Chinese voice instrument is blocked on model capability — not buildable with current models.

01

The instrument, and what its band certifies

The gate-4 judge is intuitive-holistic: ~8 unlabeled real-author exemplars, anonymized author, the question "are these the same hand?", gentle framing, no checklist, a single VOICEFIT: X/10 confidence line. The band answers "is this the author's own hand?" — a forensic bar, not a fitness-for-purpose one. A below-band score is not "worthless to the consumer," and the band is never lowered to make verdicts green.

Why intuitive, not rubric

A rubric judge — labeled exemplars plus a checklist of countable moves — skews both ways: it penalizes genuine author prose that doesn't tick every box, and rewards box-ticking fakes. Confirmed on three independent voice families:

VoiceRubricIntuitive · same prose
马伯庸 zh fictionrecall@0-FP 44% · fakes ≤5.2recall@0-FP 78% · fakes ≤2.2
Flint en fictionour output "6.8, near band"~3 — with the generic fakes
explainers en/de/fr/zhmimic ceiling 4.3–5.33mimic ceiling 0.3–2.0

The variable a 2×2 factorial isolated was framing: gentle "same hand?" keeps real and fake correctly separated; harsh "find where it betrays itself" craters real and fake alike — a floorless flaw-hunter.

02

The recognition confound

On material the judge can place in its training memory, the score measures recognition of the specific text, not voice. Watch what a name-swap does:

马伯庸 panel · score 0–10 · named ● → transposed ○ credited floored names stripped
BAND 6.0 · RECOGNITION LINE 0 2 4 6 8 10 ← UNRECOGNIZED · FLOORED RECOGNIZED → Real 殷商 Ma the judge's exemplar work 8.5 Real 长安十二时辰 Ma 6.62 1.62 AI fake of 长安 ⚑ a fabrication 6.7 2.4 风起陇西 real Ma native Three-Kingdoms register 6.52 3.0 Obscure held-out real Ma 诡谜兽 · 海难十日谈 1.0–1.4 Our 1628 generation new content · never recognizable 1.7–3.0 Different author · Guo 0.97 Flat prose + real names names without the remembered text 0.5
Read it as two zones. Everything the judge recognizes (殷商, 长安 named, 风起陇西 named, and AI reconstructions of 长安) sits right of the line at 6.2–8.6. Everything unrecognized — obscure real Ma, transposed real Ma, our generation, a different author — floors at 1–3, indistinguishable. The arrows are the same prose with its names swapped: strip recognition and the score slides through the line.

The load-bearing proof is the fake

Real-Ma-transposed scoring low (1.62), on its own, is consistent with the innocent reading "the name-swap damaged the prose." The cell that refutes it is a fabrication scoring as high as the real author — and it keeps its names, so it is immune to the prose-damage objection:

The comparison that ends the argument
6.7AI fake · named
>
1.62real Ma · transposed

A fabrication outscores the author's own prose the moment that prose is made unrecognizable. A fake inheriting the full score proves the high score isn't a measurement of authorship at all. And flat prose + real names = 0.5 (lowest in the study) shows the names are a retrieval key, not the signal.

The four-arm recognition test

Generalized from this — never prove recognition-binding with arm 1 alone:

  1. Real-author transposed — names swapped, prose else byte-identical.
  2. Fake of recognized material, named and transposed — the decisive arm.
  3. Obscure held-out real — a work the model never memorized.
  4. Names-only on flat, voiceless prose.

The decisive contrast is arm 2 named vs arm 1. A judge that lets a fake of recognized material outscore the real author's transposed prose is recognition-bound → rebuild it, or declare it blocked. Mechanism: the judge performs passage-identification — names as the retrieval key — not prose-craft reading.

03

The language asymmetry

The same deciders, the same de-confound method, opposite outcomes by language. For English there is a voice-detector under the recognition halo; for Chinese there is nothing under it.

English · Eric Flint READS VOICE ✓

010 transposed real · 5.2–6.4 fakes · 3.1–3.3 gap ≈ 2–3

Names stripped, the judge still credits the hand — recognition adds only ~0.5–2.5 on top.

Chinese · 马伯庸 CAN'T READ ✗

010 transposed real Ma + AI + different author all overlap · 1–3

Transposed/obscure real Ma, AI fakes, and a different author collapse together. No separation.

This generalizes the standing 2026-05-29 "Opus blind to Chinese accent" finding — and Session 30 confirmed it directly. The capability probe (B24b) ran the full 7-model slate on a clean channel (all judges + all generated stimuli on OpenRouter official upstreams, no aiberm), scoring transposed/obscure real author vs clean AI-reconstruction fakes. Four capable judges credit transposed real Flint above the fakes — gpt +3.53, gemini +6.41, qwen +2.61, opus +1.93 — while the same four cannot separate real Ma from fakes (opus −0.32, gpt +0.40 wash, gemini +1.71 noise). opus is the cleanest demonstration: +1.93 English vs −0.32 Chinese. So a transposition-robust maboyong judge is not buildable with current models — a Chinese-authorial-voice-reading wall, not a broken instrument (stimuli re-confirmed clean, judge and generation). The earlier aiberm "opus signal" (3.57) was killed twice — clean judge, then clean fakes.

The wall is language-and-register, not language alone. Re-run the same probe on Lu Changhai's Chinese exposition (B24c, 2026-06-16): a strong few-shot mimic — a frontier model given Lu's own 8 exemplars, writing his hand on a fresh topic — clears the judge as easily as recognition-stripped real Lu (gpt-mimic glm 9.4 / qwen 7.8 / gemini 7.6 / opus 5.1, vs real-Lu-transposed 1.7 / 0.9 / 3.3 / 4.9). So exposition also fails the forensic hand-identity bar — Lu's signature technique is imitable, not a fingerprint. (This corrected an initial over-claim: a "+3.44 opus reads voice" delta against weak fakes, killed by the mimic capstone — §10.) But unlike Ma, transposed real Lu scores above flat/generic fakes for the graded judge (opus +2.72 obscure-only, gemini +4.28; Ma's fiction fell below, opus −0.32) — the exposition register survives transposition where fiction-narrator voice did not. That supports a fit-for-purpose gate (rejects slop; passes competent on-voice prose, real or mimic), not a forensic one — but only on an opus-led panel: the locked production deciders glm + qwen floor transposed real Lu (1.5 / 1.0, ~recognition-bound like Ma), and the gate passes strong mimics by construction.

04

Length sensitivity

The judge is calibrated at ~415 words and compresses toward 5 ("can't tell") the longer the passage — a long multi-POV surface gives the gestalt read more chances to find one "slightly off" draw.

Real Flint · whole-passage
7.7 → 4.9
415w → 3375w. Absolute scores aren't comparable across lengths.
Windowed · ~420w chunks
6.03 / 7.39
vs negatives ~3.5 — discrimination restored on the unchanged panel.

Compression is judge-specific and flips by language: on English GPT-5.5 collapses while Opus holds; on Chinese it reverses. The fix — sliding-window scoring (window_score.py) — chops the passage into calibration-length windows, scores each on the unchanged panel, and aggregates.

05

Axis conflation: composition vs prose

The single VOICEFIT number blends two separable axes — composition (how a scene is built; in the plan) and prose (the wording) — and under-credits composition. Content-constant 2×2 on Flint:

CompositionProsevoicefitcompfit
FlintFlint (real)7.79.0
Flintflat2.656.4
conventionalFlint3.12.0–2.6
neutralflat0.72.0–2.2

The prose ruler needs both axes maxed; one at a time reads as a fake. The fix is a second instrument — compfit, the same exemplars asking "same construction?" — locked for Flint. compfit_zh is built but provisional (its first-pass used a recognizable scene, so it inherits §2's confound until re-run).

06

Model screening map

DATED 2026-06-12 · re-screen at every model refresh · the protocol is judge_methodology.md §3

ModelDiscriminatesRole
GPT-5.5every voice — en/de/fr + Chinese gestaltdecide
Opus-4.8en/de/fr fully; Chinese coarse-gestalt only, blind on the accent axisdecide · not zh
Fable-5everything (reals saturate 9.0); coarse for iterationcertifier
glm-5.1Chinese — the validated zh discriminatordecide · zh
qwen3.7-max卢昌海 (zh explainer)decide · zh
Gemini-3.xclear cases only (saturating binary)loose
kimi · deepseek · mistralcan't credit reals / gameable / fooled by mimicsrejected

Recipe-level caution: a model can pass coarse gestalt yet be blind on the fine-grained accent axis. A coarse screen "passing" does not retire an accent finding — to retire it you must beat the accent adversarial (B21). Until then Opus stays off the zh voice panels.

07

Transport

DATED · equivalence is per-judge, not per-relay

Two traps, both first mis-recorded as model defects: (a) slug form differs by relay (aiberm wants bare slugs, OpenRouter prefixed — engine/models.py _route normalizes); (b) max_tokens as a footgun — a low cap makes reasoning-heavy models return empty content. With both fixed, only three models can't run on aiberm — glm-5.1 (verdict truncated), qwen3.7-max and mistral (not carried); every other judge matches OpenRouter within ±1 including on Chinese (GPT-5.5 on aiberm = a maboyong real 8.4). So panels route mixed — the incompatible members go to OpenRouter (run_mixed.sh), the rest to aiberm; the maboyong zh panel runs mixed (glm→OR, GPT→aiberm), and only uniflection_zh is OpenRouter-only because both its deciders (glm+qwen) are incompatible. So “zh on OpenRouter” is about those two models, not about aiberm and Chinese. Validate each judge on a real+fake pair on any new transport — a model-list match is not equivalence.

08

Per-voice scoreboard

VoiceBand s/lFingerprintRecognitionVerdict
Flint en 7.5 / 5.5 d607d5a83bf3 clean Reads voice. Production.
马伯庸 zh 6.0 / void 058ef8e6b986 FAILED Recognition-bound. Void as voice.
PG en explainer 8.0 7c8927ccae25 owed Real 9.5–10 · mimics 2.0 · unaudited reals
Freistetter de 7.5 1982f8398fff owed POS 8.1–8.5 · mimics ≤0.5
Louapre fr 7.0 20fcaa5edf31 owed POS 7.7–8.4 · mimics ≤0.8
卢昌海 zh explainer 5.5 recog-infl 7ab47b0e9175 B24c · forensic FAIL Mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-tp) → not hand-identity. But register separates (opus +2.72) → fit-for-purpose gate only, opus-led; glm+qwen floor transposed real.

The four explainer bands and Flint's short-form band were calibrated on reals whose recognizability was not audited. By §2's doctrine, a band is only a voice band once it holds on transposed reals — Flint's long-form is clean (transposed ref); the rest owe the audit.

09

Open questions

  • B24(b) — RESOLVED (Session 30): no model reads Chinese voice without recognition. The full slate, screened clean (§3): none credit transposed/obscure real Ma above clean AI fakes; four read transposed real Flint above the fakes. → maboyong is blocked on model capability. Decision for Stephen: a fit-for-purpose bar (recognition-anchored, scoped to 1628) vs hold and re-probe at the next model refresh.
  • B24(c) — 卢昌海 RESOLVED (2026-06-16); other explainers + Flint short-form still owed. The Lu exposition judge fails the forensic bar — a strong few-shot mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-transposed); band 5.5 is recognition-inflated. But it differs from Ma: exposition register survives transposition (transposed real Lu > flat fakes: opus +2.72, gemini +4.28), so a fit-for-purpose "competent on-voice explainer" gate is buildable — but only on an opus-led panel (the locked glm+qwen floor transposed real), and it passes mimics by construction. Decisive follow-up (not run): a different real Chinese science-popularizer, transposed, on glm+qwen — register-gate vs Lu-gate. Evidence: recognition_probe_b24c_2026_06_16.json. PG / Freistetter / Louapre + Flint short-form still owe the transposed-reals audit.
  • compfit_zh — the content-constant 2×2, re-run on a non-recognizable scene.
  • B21 — the accent adversarial. A Ma-gestalt passage salted with native-fluency defects, Opus vs native deciders head-to-head — the only test that can retire "Opus blind to Chinese accent."
Honest position: we cannot currently measure Chinese authorial voice without recognition. We don't know how good our Chinese generation is — only that it is unrecognized, the same as real Ma on this judge.
10

The meta-lesson

A coarser or looser instrument flatters a result that a sharper test then corrects.

  • The rubric judge scored our output 6.8; the intuitive judge read ~3.
  • "Prompting is exhausted" — until the restraint test showed from-plan Flint moves ~2 → ~4.8.
  • The Chinese bake-off ranked writers ~6.7 — until name-transposition floored all of them, real Ma included.
  • "Opus blind to Chinese is retired" — until the axis was named: the gestalt screen never probed the accent axis.
  • "Lu's non-fiction reads voice (opus +3.44)" — until the few-shot-mimic capstone showed a strong mimic clears the judge as easily as real Lu: what survives transposition is the register, not the hand. This time the flattering read appeared at the reasoning layer and the sharper stimulus corrected it (B24c).

This is why the findings here are dated, why every "at band" call owes the sharper test, and why the four-arm recognition test and transposition-before-banding are now hard rules. The same flattering-instrument failure the judge doctrine fixes at the measurement layer recurs at the reasoning layer — guard against it there too.

Evidence index

Recognition-binding · discovery (长安 6.62→1.62)judge_recognition_confound_2026_06_15.json
Recognition-binding · confirmed + Flint contrastjudge_recognition_second_test_2026_06_16.json
卢昌海 exposition recognition probe · B24c (forensic FAIL via mimic capstone; register-separation)recognition_probe_b24c_2026_06_16.json
maboyong panel lock + generation re-measurejudge_lock_2026_06_11.json
Flint short-form band 7.5/7.0intuitive_band_2026_06_11.json
Flint long-form band 5.5 + length compressionlongform_band_2026_06_14.json
B17 · Flint named-vs-transposed (~2–3pt)rewrite_threshold_b17_2026_06_11.json
Two-aspect 2×2 + compfit locktwo_aspect_2026_06_13.json
Restraint recipe · from-plan ceiling ~4.8restraint_findings_2026_06_14.json
Chinese long-form bake-off (VOID) + compfit_zhlongform_and_compfit_2026_06_14.json
Claude-judge screen (Opus coarse-gestalt; Fable-5)claude_judges_2026_06_12.json
Transport equivalence per-judgetransport_aiberm_validation_2026_06_12.json
Explainer validations (en/zh/de/fr)intuitive_validation_2026_06_11.json