x129 › writing › trope-benchmark › outline
2026-07-04 prose
LLM benchmarks, and writing one · outline
-
what it takes to write an LLM benchmark, worked out on tropes
- a benchmark: fixed tasks, known answers, models get scored on them
- the domain: storytelling tropes, because I enjoy them
step 1: create a vocabulary
- 100 to 300 tropes from TVTropes, hand curated, with clear rules
- the test of a rule: two readers agree
- drop what is too vague to grade, e.g., Foreshadowing
- the taxonomy is swappable; the skill tested stays the same
step 2: find the text
- models have seen many passages with their labels attached
- rules out anything with a TVTropes page, and popular stories
- sources: my own synopses, undiscussed works, mid-range r/WritingPrompts
step 3: label the text
- two people, independently, with agreement measured
- I don't know yet who the second person is
step 4: grade the answers
- list against list, no partial credit, no judge to persuade
- read 50 transcripts before trusting a score
- probe each model for whether it recognizes the passages
step 5: publish
- harness and demo set public, scored labels private
- only private passages resist someone labeling the test set on purpose
- later: straight, subverted, or inverted