My question is pretty simple: what tool, if any, does Open AI provide for ingesting some pdfs of an (almost) arbitrary size, and then carrying out a conversation based on that?
My concrete case is giving it all the relevant law docs from my country, and build a SaaS on top of that where lawyers can each have their virtual assistants, helping them on each case. They’d have a pre-trained model, and can further upload new docs and ask for help continuously – you know, as they’d do with an assistant.
At first glance, it seems I’d have to look into embeddings, but something tells me I should explore Assistants API more. Problem is, I don’t see any clear costs and limits.
For example, what limits are there for a pdf that I’d upload to assistants API for training? And for inference? file size, token count etc.
How does that compare to going the “generate your own embeddings” route, financially and in terms of reliability?
Thanks a lot. I’m sure I wasn’t the first to ask this, but I didn’t find something similar enough to my case
The tools are there for exactly that, but the results vary. Embeddings on your side would be powerful. With embeddings or other organisational strategies plus some ‘hand-holding’ or ‘spoon-feeding’ an assistant could probably do pretty well, but there is always a need for error tolerance.
For embeddings, if it’s a one-off openai costs aren’t bad. If it’s a truly huge set of documents (sounds like yours is) and cost prohibitive you can run high quality embedding models locally and they don’t need a lot of resources.
There’s great frameworks for this kind of thing, langchain etc, I use chromadb for an embedding dbase it works great.
Things are moving very fast. It’s possible that assistants abilities to deal with large documents may improve considerably in the near term.
If by pre-trained model you mean a fine-tuned model, I’ve heard it said you can teach it new skills, but not new knowledge - because of the way llms and fine-tuning works in practice. For knowledge it’s RAG.
Assistants costs are a big unknown for me, very opaque.
Thanks a lot! So transforming the data locally into embeddings (still have to learn more about them) and then feeding it into the gpt3.5-t/gpt4 via openai would be the best path for me to explore atm, right?
Data set can vary from 500 to 5000 pages, so yeah, not really small.
And for the inference step, it can also be several hundred pages (but less than 100 in most cases)
In one sentence: They are like arrows of meaning, you can then compare the arrows and see if they point in the same direction, if they do the texts are more than likely relevant.
Getting up to speed with embeddings is a lot of work, but worth it.
Do a simple chromadb tutorial and it should all make sense.
You can run ‘tiktoken’ on your docs and see how much openai embeddings cost. I think they are the best out there and straightforward to use. Averaging around 100 docs per call and 5k for the main knowledge might not be too bad, depends on your pricing structure.
I think there may be a tiny bit of misunderstanding on how the whole thing works, but you’re on the right track.
You are on a dead end, though - this will not work for anything except trivial lookup cases, as the AI is not going to do analysis. if you want that, you are either in research territory, or at least in implementing a BASHR loop yourself - at a minimum, possibly with an agent swarm that discusses a topic to get a possible legal theory. GPT 4 with 128k CAN handle this, but - costs will be “lawyer level”. The problem is that you may not really be good with embeddings - besides the relevant parts for law not being in the embeddings (like reference numbers to laws and cases) you end up with a lot of “similar enough” documents that you end up having to do summarization and filtering in the AI for anything more than a simple lookup.
Btw., do not use t term “!training” for a RAG approach -Trainign, Tuning (more: Fine tuning) and RAG (retrieval augmented generation) have specific technical meanings.
thanks for pointing that out. I was looking more at a RAG approach. that’d also be cheaper.
I’m not following on the “embedding can’t encode article numbers” idea. Doesn’t chatgpt know about exact dates of most events, and, I assume, that’s based on embeddings?
Think for a moment. An embedding is essentially a position on a multi dimensional space of 1536 dimensions. THigns that are close together have similar coordinates. How doput thing sclose together that are cas e numbers? By what measurement? Also, when - think lawyer. THe case dsoes not matter unless you know it has not been revoked or overturned, which means access to CURRENT legal databases.
You really need to read up on the basics of RAG and embeddings and think edge cases.