While uploading a conversation file (containing 100 short conversations) as a “Knowledge” file (size 170 KB), GPT only recognizes 37 conversations, which is incorrect. The file is a valid JSON format, but I don’t understand why it’s not reading the entire content. Can someone help me resolve this issue?
Sharing new findings. As it seems, there is an optimization on OpenAI’s side and they only looking on the entire file with specific requests (Such as python clustering instructions)
Chatgpt used to index your files and use a search method to pull relevant passages out. I don’t think that’s changed all that much. It rarely if ever pulls everything into the LLM’s working context, unless your file is very small.
(even if it did, it probably wouldn’t be able to “grok” your entire file as such - how much the model maintains in “immediate working memory” (for lack of better terms) is also limited - that’s why techniques such as chain of thought retrieval aggregation are necessary).
If you want to do operations like counting in a file, the python interpreter is a good tool.
If you were hoping to do multi shot with chatgpt, that probably won’t really work all that well because you have no control over what gets loaded into the context (unless you can fit it all into your system prompt).
If you want more control over what the model does, you probably won’t get around using the APIs.
ChatGPT is pretty limited, but I do think that most common LLM use-cases can be implemented with it if you tweak your approach a bit. (e.g.: don’t treat chatgpt as a machine learning model that can be trained)
The other thing I’d add is that you’ll get more millage if you convert your file from JSON to something like markdown. This will result in the file containing way less tokens and GPT will be able to fit more of the file into its context window. Even just switching to YAML will likely result in a significant token reduction.
I’ll third @Diet and say the ChatGPT UI probably isn’t the level of control you’re after. You can do a lot more with the Assitants API.
That said, as a default ChatGPT only summarizes information without explicit and frequent instructions otherwise, whether input or output.
This is because of the massive amount of context it needs when making a search for terms. The larger the file it looks through, the larger each individual search will be. The larger each individual search, the more likely that the research will be lost to the context window with each new item added to the list.
You can help mitigate this by descriptively titling your conversational knowledge base files by topic or keyword, and making sure your Instructions make it clear which file to use when. The model can definitely make far better searches when it has a keyword and a guess on location.
After you have well-titled files in well structured Instructions, have a guess on the location, keyword on the topic, and explicitly asked it not to summarize once or twice, then, paradoxically, it’s helpful to ask it to summarize the information you’re looking for and quiz it on the answers.
If you know your data, you can catch when it’s guessing and have it correct it’s gaps. It’s not efficient, but it kinda-mostly works for long conversations.
Sounds like grouping and adding keywords might help. Try to define a number of abstract topics e.g. programming, gardening, marketing, etc. Then categorize each conversation and use a two step agent system where you let the model decide which topic or keywords are relevant and then decide which model with only a handful conversations should take over the job to create an answer…
Welcome to knowledge graphs.