Stopwords and training of current models

Does somebody have a source for me to clear the question:

  • are current models (current means not only by OpenAI) trained with or without stopwords?

Chat models have stop sequences - just a single token used for the end of a message or end of a function that, if produced, terminates the output.

Otherwise the AI would produce text forever, because the whole way it works is one-directional iterative next-token generation.

I mean stopwords, as used for NLTP preparing: a, and, in…

In that language AI model are able to write English and others properly, from the corpus they were pretrained on, it would be apparent that particular parts of language were not removed.

that was my question: are they indeed trained on data containing stopwords? Do you probably have a source for me?

Im not sure but i think you are refering to an old approach that we used to remove stop words. As far as i know we train the models with phrases that are the devisions of sentences into phrases . They might still have stop words in them. The raw data contains everything comtainkng stopwords

1 Like