Does anyone know whether it is possible to determine whether a specific sequence is present in GPT-3's training data?

I am interested in performing some experiments to understand reasoning and generalization in GPT-3, and it would be useful to know whether a particular test sequence appears in the training set. Does anyone know whether this is possible, and if so how?

2 Likes

@m-a.schenk is right. Given the huge amount of textual data GPT-3 was trained on, there’s a very high chance that data was in the training dataset.

That being said, there are some ways to determine if the data was in the training dataset. Every model returns the date it was trained on in each completion response, so any information that pertains to event that happened after the date of training was definitely not in the training dataset.

Take covid-19 for example, GPT-3 davinci was trained way earlier and didn’t have any data about it during training.