LLM benchmarks, and writing one
I want to know what it takes to build a benchmark for language models, so I'm writing one about storytelling tropes.
I’ve been curious about what it takes to build a benchmark for language models: a fixed set of tasks with known answers that models get scored on. So I’m writing one.
I picked a domain I enjoy: storytelling tropes. The task is for a model to read a passage of fiction and name the tropes in it. New models keep coming, and this gives me my own way of measuring them.
The rough plan is below. Nothing is built yet.
The plan
Step 1: create a vocabulary
The trope names come from TVTropes. The site is a folksonomy, meaning a community built it with no enforced structure, so entries vary a lot in how broad or narrow they are. I’ll curate 100 to 300 tropes by hand into a fixed vocabulary, with clear rules for when each one counts as present. I already have a few favorites in mind. The test of a good rule is that two people reading the same passage agree. Tropes too vague to grade get dropped. Foreshadowing, e.g., is present in almost any story, so a model earns credit by naming it everywhere. The vocabulary could be swapped for a folklore index like Aarne-Thompson-Uther and the benchmark would test the same skill.
Here is what one item could look like. The passage, which I made up: “The old fisherman teaches the girl to read the tides. On the night of the storm he does not come back from the water, and in the morning she takes his boat out alone.” The right label is Mentor Occupational Hazard, the trope where the teacher dies and the student steps up. Revenge is a plausible wrong answer, because the story turns on a death but nothing in the passage seeks payback. The Mentor is a near miss, true but broader than what the passage supports.
Step 2: find the text
The hard part is finding passages the models can’t cheat on. The benchmark fails when a passage and its trope labels appear together in training data, because then a model can recall the answer instead of reasoning. That rules out anything with a TVTropes page, and popular stories in general, because people discuss popular stories and the discussion names the tropes.
The safer sources are synopses I write, obscure works nobody has discussed, and recent posts from writing subreddits like r/WritingPrompts, taken from the middle of the upvote range. Two details on the Reddit option. The prompt often names the trope outright, so the model never sees the prompt, though the annotators can use it as a hint. And the stories belong to their authors, so nothing gets republished without asking. At a few hundred passages, asking is feasible.
Step 3: label the text
Two people label each passage independently, and I measure how often they agree. Cohen’s kappa is the standard number for this. Only labels with decent agreement survive. Where the two disagree, the trope gets cut or its rule gets sharpened. Each passage also gets a few tropes that sound plausible but are absent, to check that wrong guesses get punished.
If I label everything myself, the benchmark is a quiz about my own opinions. I don’t know yet who the second person is. That is the first real problem to solve.
Step 4: grade the answers
The grading is simple on purpose. The model returns a list of trope names from the vocabulary, and the grader checks that list against the labels. No language model judges anything, so there is no judge for a model to persuade.
The score is F1, which balances precision (what share of the guesses were right) against recall (what share of the true tropes were found). Plain accuracy fails here, because most tropes are absent from any passage, so a model that answers nothing looks near perfect. Precision punishes the cheapest trick, which is guessing everything plausible. A near miss scores zero. Answering The Mentor when Mentor Occupational Hazard is correct earns nothing, and the miss goes into a separate analysis instead. Every partial credit rule gives a model a way to raise its score without getting better at the task.
Two checks before trusting a score. First, read about 50 transcripts sorted by score, because some high scores will be garbage the grader missed and some low scores will be right answers it mishandled. Fix the grader or the labels, run again, and expect several rounds. Second, probe each model for contamination. Give it the first half of a passage and see whether it can complete the text or name the source. A model that recognizes a passage can’t be credited with reasoning about it, so the probe results get reported next to the scores.
Step 5: publish
The harness, a small labeled demo set, and results for five to ten current models go in a public repo. The labels for the scored set stay private, because anything published gets trained on within a cycle or two. Public passages still pick up discussion over time, so the plan is to replace them on a schedule. The only passages that stay clean are the ones I write and never publish.
One caveat. A competent annotator can derive the labels from any public passage, so private labels protect against accidental scraping, not against someone labeling the public set on purpose. At this scale it is unlikely anyone would bother, but the writeup should say so.
Later
If the basic version works, the harder question is whether a story plays a trope straight, subverts it, or inverts it. That asks the model to work out what expectation the story sets up and then violates.
The vocabulary is step 1.