x129 › writing › trope-benchmark › outline

2026-07-04 prose

LLM benchmarks, and writing one · outline

what it takes to write an LLM benchmark, worked out on tropes
- a benchmark: fixed tasks, known answers, models get scored on them
- the domain: storytelling tropes, because I enjoy them
- step 1: create a vocabulary
  - 100 to 300 tropes from TVTropes, hand curated, with clear rules
  - the test of a rule: two readers agree
  - drop what is too vague to grade, e.g., Foreshadowing
  - the taxonomy is swappable; the skill tested stays the same
- step 2: find the text
  - models have seen many passages with their labels attached
  - rules out anything with a TVTropes page, and popular stories
  - sources: my own synopses, undiscussed works, mid-range r/WritingPrompts
- step 3: label the text
  - two people, independently, with agreement measured
  - I don't know yet who the second person is
- step 4: grade the answers
  - list against list, no partial credit, no judge to persuade
  - read 50 transcripts before trusting a score
  - probe each model for whether it recognizes the passages
- step 5: publish
  - harness and demo set public, scored labels private
  - only private passages resist someone labeling the test set on purpose
- later: straight, subverted, or inverted