I have seen when using ChatGPT, it use’s the “em dash” (—) quite frequently. Is it a quirk from the training data? I just noticed it often and wanted to see if any one else has seen this characteristic. Also it is a way to easily find if a text is AI-generated or not, that’s at least something I use
Looking forward to your insights and perspectives!
I like to use the em dash - especially for asides such as this - particularly because it’s less jarring than a parenthetical (which can often go off topic and interrupt the flow of the sentence).
Although I am too lazy to use actual em dashes because they’re not on my keyboard, it would generally be a good idea to send your text through a spell checker before adding it to your training data. I believe word and outlook fixes them automatically as well.
I believe it is directly related to training, in the sense that so much of its training data probably has em dashes. I don’t think there’s a bias from instruction. It’s a bias from the total canon of data, and the striking use of em dashes. Remember too the type of sources used for training tend to have a ton of them!
Em Dashes in Formal and Literary Writing
Many books, articles, and essays, especially formal or literary sources, favor em dashes for emphasis and parenthetical asides.
Since a significant portion of training data comes from well-edited writing (think published books, Wikipedia, journalism, and academic sources), the prevalence of em dashes is high.
Style Patterns in Training Data
If the sources ChatGPT is trained on use em dashes frequently, the model learns that they are a common and valid punctuation choice.
Many professional writers and journalists use em dashes liberally, and this stylistic preference carries over into AI-generated text.
Overuse Due to Pattern Recognition
Because ChatGPT generates text probabilistically, it sometimes “over-indexes” on high-frequency structures. This could be happening with em dashes, colons, and certain stylistic choices that appear disproportionately in well-written training data.
Since the model doesn’t have an innate sense of when variety is better, it sometimes leans on these structures more than a human would.