Hello i am using a GPT-3.5 model to analyse a complex and long Json document below is the explanation of the workflow :
the user writes a prompt on the analysis he wants to do on the document
and attaches a document alongside the prompt , the document is a wireshark capture containing multiple interactions betweens IP adresses that can be ipv4 or ipv6 , the wireshark can go up to 50mb , the wireshark document will be converted to a JSON file that will be given to the LLM to generate a response
i’ll be using a RAG architecture for the interaction with the GPT3.5
can you give me some tips i am new to using LLMs and i am want to know how reliable GPT3.5 in analysing lengthy and heavily nested json files
i appreciate it if you can give any other tips that you might think can be helpful
thank you for your attention
Hello @louay. The first problem you’ll face is context size. The GPT-3.5 models only allow for a maximum of 16k tokens which is likely less than the 50MB of json data. This means that you have two options:
- Be creative in how you pass this data to the model, maybe breaking it up into smaller more manageable chunks.
- Use a larger, more expensive model. This could be an OpenAI model like GPT-4 which goes up to 128k tokens but it can get pretty expensive working at the limits of that size. You could also try a different company. Some of Googles models go up to 1 million tokens but again, it can get quite expensive.
I’d go with option 1. Its cheaper and more scalable.
This sounds like a difficult use case. RAG is already relatively finicky - LLMs tend to make up answers if the RAG results don’t have the answer. And since you’ll need to break down the input somehow (like @David_Blair said, 50mb is way too long for a single message), most of your RAG calls will be empty: “nothing in JSON wireshark chunk 1,” “nothing in JSON wireshark chunk 2,” etc. Depending on how the JSON is structured, you will probably also need to reconstruct the JSON hierarchy across chunks, so in chunk 14 you indicate where in the JSON hierarchy you are. Sounds tough, would love to hear if you get the project off the ground!
On a different note, 3.5 is very unreliable in outputting JSON, but I haven’t used it for analyzing JSON input. You might try 4o-mini, which is quite cheap and is more prepared for JSON.
Hi @louay and welcome to the forums!
(Disclaimer: I used to work with telecom and networks in my previous life, and actually developed more traditional classification models for anomaly detection in Wireshark traces).
Irrespective of the use case, here are some things to keep in mind:
- Any nested or deeply hierarchical structure won’t play nice with any LLM (even the latest GPT-4o checkpoints), because it wreaks havoc with the token logits, i.e. the model gets super confused with what to attend to, and what logit it should assign to the next token
- With that in mind, it’s a great idea to do some pre-processing of your pcap before feeding it to an LLM; when feeding it to an LLM, i find that flat structure with rows of multiple key-value pairs works best
- As others have pointed out, GPT-3.5 is probably the worst performer here, first in terms of general model performance, and second in terms of significantly limited context size. Saying that - if you can do significant pre-processing logic before feeding it to GPT-3.5, it may actually perform well (e.g. it performs well in “similarity” tasks)
I had a similar issue when passing emails from Gmail API to the LLM, there’s a lot of JSON data in the headers (relevant stuff and very detailed technical stuff that doesn’t add value to the LLM) and I believe it was the primary cause of the confusion I was seeing.
I now do pre-processing on the API responses and create something that is much more human readable. Responses from the LLM are drastically better.
LLMs are pretty amazing, but they aren’t fully magical. They still need help and perform better when spoonfed the right info.