ChatGPT's Em Dash Habit: A Training Artifact or Design Choice?

I believe it is directly related to training, in the sense that so much of its training data probably has em dashes. I don’t think there’s a bias from instruction. It’s a bias from the total canon of data, and the striking use of em dashes. Remember too the type of sources used for training tend to have a ton of them!

  1. Em Dashes in Formal and Literary Writing
  • Many books, articles, and essays, especially formal or literary sources, favor em dashes for emphasis and parenthetical asides.
  • Since a significant portion of training data comes from well-edited writing (think published books, Wikipedia, journalism, and academic sources), the prevalence of em dashes is high.
  1. Style Patterns in Training Data
  • If the sources ChatGPT is trained on use em dashes frequently, the model learns that they are a common and valid punctuation choice.
  • Many professional writers and journalists use em dashes liberally, and this stylistic preference carries over into AI-generated text.
  1. Overuse Due to Pattern Recognition
  • Because ChatGPT generates text probabilistically, it sometimes “over-indexes” on high-frequency structures. This could be happening with em dashes, colons, and certain stylistic choices that appear disproportionately in well-written training data.
  • Since the model doesn’t have an innate sense of when variety is better, it sometimes leans on these structures more than a human would.
1 Like