March 22, 2026 - 11 min read

How preference tuning ruined a perfectly good punctuation mark

Once you notice em dashes in AI writing, you can’t unsee them. Chatbot replies, LinkedIn posts, blog drafts, cold emails. The internet’s folk wisdom says: too many dashes, probably AI.

Fair enough. But I kept wondering: why? Not “oh, it learned from training data” why. I mean mechanically. What actually happens inside the pipeline that turns a punctuation mark into a nervous habit?

So I went and looked. Traced the whole thing from raw training text through tokenisation into the preference tuning loop. And honestly, it’s more interesting than the punctuation. It reveals something fundamental about how these models learn to write, and why they’re so bad at knowing when to stop.

Let’s walk through it.

The numbers first

Before we get into the machinery, look at the actual data. Because the interesting part is that some models do it and some don’t.

Em dashes per 100 words Same prompts, temperature 0.7

Palmyra X5

2.17

DeepSeek R1

1.46

Nova 2 Lite

1.30

Claude Haiku 4.5

1.13

Claude Sonnet 4.5

1.07

Claude Opus 4.6

1.00

Claude Opus 4.5

1.00

Claude Sonnet 4.6

0.79

Pixtral Large

0.48

Nova Premier

0.23

Nova Pro

0.21

Claude Sonnet 4

0.20

Claude 3.5 Haiku

0.18

Approximate frequencies by model family. OpenAI shipped a "disable em dashes" toggle in GPT-5.1 (Nov 2025); pre-fix behaviour shown. Data compiled from AWS Community and Plagiarism Today benchmarks.

Palmyra X5 leads at 2.17 per 100 words. DeepSeek R1 sits at 1.46. Claude’s own family spans the full range: Haiku 4.5 at 1.13, Sonnet 4.6 at 0.79, Opus 4.6 at 1.00. Meanwhile every Llama model from 3.1 through 4 Maverick scores a flat zero. Sean Goedecke found that within OpenAI’s own family, GPT-4o used roughly 10x more em dashes than GPT-3.5. Same architecture lineage. Wildly different punctuation habits.

If this were some inherent property of language modelling, every model would do it. They don’t. So what gives?

It’s a fingerprint of training decisions. Three of them, specifically. Let’s trace each one.

Step 1: The training data is biased toward polish

Every large language model starts by reading a massive chunk of the internet. But not a random chunk. Labs are very deliberate about what gets in.

The GPT-3 paper said it plainly. OpenAI trained a binary classifier on a reference set of “high-quality” documents (think: books, Wikipedia, well-edited web pages), then ran that classifier over Common Crawl to filter out the noise. Documents that looked like published prose got kept. Documents that looked like forum rants or SEO spam got tossed. On top of that, they upsampled curated datasets, so books and Wikipedia appeared far more often in training than their raw size would justify.

Every major lab does a version of this. Llama 4 used heuristic quality filters on its dataset. Google’s Gemini family uses quality-based data curation. The specific corpora differ, but the principle is the same: train on text that looks like it was written by someone who cares.

Here’s where the em dash sneaks in. “Text written by someone who cares” tends to be edited, essay-like, published prose. And that prose loves em dashes. It also loves balanced clauses, confident transitions, and all the little signals we associate with having an editor in the loop.

Goedecke’s hypothesis is that when labs started digitising print books for training, they pulled in huge amounts of 19th-century literature, which uses around 30% more em dashes than contemporary writing. So the model learns em dashes as a core feature of what “good prose” sounds like, because it’s been reading a lot of Dickens.

The takeaway: the model learns something much blunter than “use em dashes when they’re tasteful.” It learns that em dashes appear very frequently in text that humans preserved, curated, and upweighted as good. It’s the difference between learning to cook and learning that Michelin-starred restaurants use a lot of butter.

Step 2: The tokeniser makes it cheap

Before the model can learn anything from text, it has to break it into tokens. This is the step people love to overexplain, so I’ll keep it tight.

How the tokeniser sees it BPE tokenisation · cl100k_base encoding

The model — trained on curated text — overproduces em dashes.

The model — trained on curated text — over produces em dashes .

The tokeniser doesn't cause the em dash habit. It makes the habit cheap to sustain: one decoding step per dash, same as any other common punctuation mark.

Most modern models use Byte Pair Encoding (BPE). You start with individual characters, then iteratively merge the most common adjacent pairs until you’ve built up a vocabulary of roughly 100,000 tokens. Common words like “the” get their own token. Rare words get split into subword pieces. And the em dash character (U+2014, the long one) appears frequently enough in training data that it earns its own token in most BPE vocabularies.

In OpenAI’s tiktoken library, you can check this yourself. The em dash is a single token. One decoding step. One probability to compute. Same cost as generating a comma, a full stop, or the word “the.”

This matters, but not the way people think. The tokeniser makes the em dash habit cheap. It doesn’t cause it. If the model has already learned that em dashes correlate with “good writing” (step 1), the tokeniser makes acting on that signal basically free. There’s no multi-token sequence to assemble, just one token, slotted in wherever the model thinks polish is needed.

But here’s the proof that tokenisation isn’t the cause: Llama 4 uses BPE too. Same type of tokeniser. Zero em dashes in the output. The difference is in what happens after.

Step 3: Preference tuning is where it actually breaks

This is the part I find most interesting. And it’s where the “just a punctuation quirk” story falls apart into something much more revealing about how these models actually work.

After pre-training (reading all that text) and supervised fine-tuning (learning to follow instructions), there’s a third step that fundamentally shapes how a model writes. The goal: make the model produce outputs that humans prefer.

The preference tuning loop RLHF / DPO / GRPO all share this core cycle

Generate Model writes multiple responses to the same prompt

Compare Human labeller picks the "better" response from each pair The polished one usually wins

Learn A reward model trains on thousands of these comparisons until it can predict what humans prefer

Optimise Language model gets tuned (PPO, DPO, or GRPO) to score higher on the reward signal

Each cycle amplifies whatever signal the labellers rewarded

The method changes (RLHF → DPO → GRPO), but the dynamic doesn't. Compare outputs, learn what wins, push the model toward those patterns. If polish wins at step 2, the model learns to produce polish.

How this actually works

The original method, Reinforcement Learning from Human Feedback (RLHF), has a straightforward loop:

Generate. The model writes multiple responses to the same prompt.
Compare. A human labeller looks at pairs of responses and picks the “better” one.
Learn. A separate neural network (the “reward model”) trains on thousands of these comparisons until it can predict what humans prefer, assigning a scalar score to any output.
Optimise. The language model gets tuned (via PPO, Proximal Policy Optimisation) to produce responses that score higher on the reward model.

Then repeat. Each cycle sharpens the signal. The model learns, with increasing precision, what “winning” looks like.

The field has moved past vanilla RLHF. DPO (Direct Preference Optimisation) cuts out the reward model entirely and optimises the language model directly on preference pairs. Fewer moving parts, more stable training. GRPO, introduced by DeepSeek for R1, skips the critic model and ranks groups of responses against each other. Anthropic’s Constitutional AI uses AI feedback instead of human feedback. By 2026, roughly 85% of models use some form of preference-based alignment.

But the core dynamic is identical across all of them. Compare outputs. Learn what wins. Push the model toward those patterns.

Where the bias enters

Think about step 2. You’re a data labeller. You’ve got two model outputs in front of you. One is shorter, a bit rough, maybe even more accurate. The other is longer, flows better, uses confident punctuation, sounds like a well-edited essay. You’re under time pressure. You’re doing hundreds of these comparisons a day. Which one feels better?

The polished one wins. Almost every time. And the model takes notes.

OpenAI’s own ChatGPT launch post (2022) admitted the model was “often excessively verbose and overuses certain phrases,” partly because trainers preferred answers that looked more comprehensive. That admission is still accurate three model generations later.

Em dashes are a very efficient signal for “sounds like a strong, considered answer.” They let a sentence pivot, qualify, or dramatise itself in one move. Used lightly by a human writer, they work because they’re selective. Used by a system optimising for “this response should sound confident and complete,” they become a crutch. The model doesn’t learn when a dash is earned. It learns that dashes correlate with winning.

The real lesson: preference tuning optimises for what correctness sounds like to the people doing the judging.

It’s not just dashes

The em dash gets all the press, but it’s not even the most dramatic example of this pattern.

6,697%

Increase in 'delves' in PubMed abstracts from 2020 to 2024

Kobak et al., 15M+ abstracts, published in Science Advances (2025)

When Kobak et al. analysed over 15 million biomedical abstracts, they found at least 13.5% of 2024 abstracts showed LLM processing markers. “Delves” appeared at 28x its expected frequency. “Showcasing” at 10.7x. “Underscores” at 13.8x.

A Carnegie Mellon study (PNAS, February 2025) found that instruction-tuned LLMs produce a “noun-heavy, informationally dense” writing style. Present participial clauses at 2-5x the human rate. Nominalisations at 1.5-2x. This persists even when you prompt the model to write informally. The study identifies instruction tuning, not pretraining, as the primary cause. Same mechanism I described above, just measured from a different angle.

The tells rotate. “Delve” peaked in 2023-24, then dropped as labs adjusted their training. But new ones keep showing up because the underlying mechanism doesn’t change. It just finds new words to overweight. Wikipedia maintains a running list: “tapestry,” “realm,” “ever-evolving landscape,” “embark on a journey,” and the fondness for triples, which the page drily calls “a habit AI adores, cherishes, and delights in.”

A stylistic fingerprinting study (March 2025) found these linguistic markers are so consistent that their ensemble approach achieved 0.9988 precision classifying text by model family. Even when the model tries to sound different, it can’t fully escape its own training. The fingerprint is structural.

Why OpenAI’s toggle doesn’t fix it

OpenAI acknowledged the em dash problem and shipped a toggle in GPT-5.1 (November 2025). You can now disable em dashes via custom instructions. Nice gesture. Doesn’t solve the problem.

If you’re a developer, think about what that toggle actually does. It’s post-processing. It suppresses one token (or a small set of tokens) from the output distribution. The model still wants to produce an em dash at that position. The toggle just stops it from doing so.

But the optimisation target hasn’t changed. The model still writes in that same noun-heavy, balanced, relentlessly polished way. It just uses different punctuation. You’ve taken the tie off. It’s still wearing the suit.

That’s also why good human writers are hard to imitate. They vary their moves. One sentence gets a dash. The next gets chopped short because a full stop hits harder. The next gets rewritten entirely because the aside was doing no real work. Models are bad at that kind of restraint. Once they learn a move scores well, reuse is the safe option.

McSweeney’s published a piece where the em dash defends itself, citing usage by Mary Shelley, Emily Dickinson, and David Foster Wallace: “I am not new here.” It was their most-read article of 2025. Writers are now self-censoring punctuation because readers assume anything with a dash was generated. NBC News reported that heavy LLM users submit writing with 50% fewer pronouns, shifting toward the same impersonal register the models produce. The influence runs both ways. That’s the genuinely frustrating part.

What this means if you build things

I don’t think models will always love em dashes. The numbers shift with each generation (GPT-5.4, Claude 4.6, Gemini 3.1, DeepSeek V3.2 are all different from their predecessors). But the underlying pattern will stay, because it’s structural.

If you’re building on top of LLMs:

The style is baked in at the training level. You can’t prompt it away. Temperature, top-p, and presence penalties help at the margins, but the preference for polish was reinforced over thousands of training steps. The model genuinely believes polish equals quality.
Post-processing treats symptoms. Regex-replacing em dashes, filtering vocabulary, OpenAI’s toggle: same approach, same limitation. You’re masking the output without changing the model.
The real lever is the training pipeline. Llama 4 and Gemini 3 produce zero em dashes, because their training data curation and alignment choices avoided this trap from the start. That’s the actual fix.
Users notice. The polish that preference tuning creates can reduce trust rather than increase it. When everything sounds like a well-edited essay, it signals “generated, not considered.” Your AI-generated docs, PR descriptions, and chat responses all carry that same cadence. Your users can feel it.

Edit the thought, not the punctuation

Don’t cargo-cult a banned punctuation list. Edit for judgment.

Does this sentence actually need the aside?
Is the pivot earned, or is the model just trying to sound supple?
Would a plain full stop hit harder?
Is there a real observation here, or just “polished writing” noise?

That’s the habit worth building: spotting the places where style is pretending to be thought.

The em dash is not the crime. The crime is a training pipeline that optimises for what good writing sounds like instead of what good writing is.