I believe it is directly related to training, in the sense that so much of its training data probably has em dashes. I don’t think there’s a bias from instruction. It’s a bias from the total canon of data, and the striking use of em dashes. Remember too the type of sources used for training tend to have a ton of them!
- Em Dashes in Formal and Literary Writing
- Many books, articles, and essays, especially formal or literary sources, favor em dashes for emphasis and parenthetical asides.
- Since a significant portion of training data comes from well-edited writing (think published books, Wikipedia, journalism, and academic sources), the prevalence of em dashes is high.
- Style Patterns in Training Data
- If the sources ChatGPT is trained on use em dashes frequently, the model learns that they are a common and valid punctuation choice.
- Many professional writers and journalists use em dashes liberally, and this stylistic preference carries over into AI-generated text.
- Overuse Due to Pattern Recognition
- Because ChatGPT generates text probabilistically, it sometimes “over-indexes” on high-frequency structures. This could be happening with em dashes, colons, and certain stylistic choices that appear disproportionately in well-written training data.
- Since the model doesn’t have an innate sense of when variety is better, it sometimes leans on these structures more than a human would.