Hello. We are using GPT-4 API for generating a large comprehensive list of data (vocabulary terms) but it seems that a large amount of the data we receive is repeated multiple times. Is there a way to upload our existing data so that GPT won’t repeat existing terms it has already generated?
You can certainly do that.
In the chat format, there’s two main ways I could see:
Give the assistant’s prior replies to questions, just as if you had been using a chatbot.
user: provide vocabulary words
assistant: blah blah
user: provide more vocabulary words
assistant: menehune blalah
Amend your user input:
user:
// existing vocabulary words:
blah menehune blalah
// instruction: provide even more vocabulary words not listed
assistant: kine broke da mouf ono grinds
Such continuation without overlap (but you can’t trust ChatGPT not to forget its chat…
Better might be to define particular topical areas of data that would naturally have no overlap.
You can ask the AI to write you a de-duplicating python script just to make sure.
if your’re using the api you can use “frequency_penalty” and “presence_penalty”.
Both of these parameters are useful for controlling the repetitiveness and diversity of the content generated by the model. By adjusting them, you can fine-tune the output to avoid redundancy and encourage the generation of unique text, which is especially valuable in tasks like creating extensive lists, brainstorming, or writing content where variety is key.
here’s an example:
curl https://api.openai.com/v1/engines/davinci-codex/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"prompt": "Generate a comprehensive list of vocabulary terms related to astronomy:",
"temperature": 0.7,
"max_tokens": 150,
"top_p": 1,
"frequency_penalty": 0.5,
"presence_penalty": 0.5
}'
The frequency and presence penalties are not word-banishers though, they operate on tokens, and can have unforeseen effects, like your cumulative prohibition of the quotation mark of a json.
That’s correct😁,
To control the results Kevin, use the playground and use both of our responses.
Try using a better prompt, past responses, python and this parameters.