Is 17k Tokens for infromation retrieval from file too much?

Hello, I’m building a MVP with assistants.
Created a Voice assistant for specific purposes,

I have 1 text file uploaded which is 77k characters (11400 Words) long and is stored in the vector storage.
My input (instruction) prompt is around 500 Tokens, but whenever Assistant needs to search this text file for information it spends from 17 to 20k Tokens.
Is this normal or do I have a terrible leak somewhere?
P.S Using GPT4 Turbo

Thank you in advance for your help.

The built-in retrieval mechanism is very greedy and it will grab anything and everything it thinks might be even remotely relevant.

At this point, this behaviour should be expected, there really isn’t anything you can do to mitigate this short of going through and trimming the fat from your uploaded document in the hopes there will be fewer tangentially related chunks available for it to ingest.

1 Like

What worked Really well for me is that I converted text file into a pdf, not same file is eating about 1k Tokens.

It’s a bit counter-intuitive and going against the grain of embedding documents to go from Text → PDF. Especially if you just saved it without performing any work on it. I would actually that there’s something wrong here.

Do you mind sharing this document? From 17k → 1k tokens is quite an accomplishment if the results are just as accurate.

Or, at the least, what kind of document was it? I could maybe see a table document maybe performing better if the text isn’t baked into the document and is better to be read row-by-row.

Idk if my math is right here but 1 word is usually around 1.33 tokens. If your document is 11,400 words then that would mean the complete document is ~15,000 tokens.

sure it’s

  • CHARACTERS

44400

  • WORDS

5976

  • SENTENCES

564

  • PARAGRAPHS

189

  • SPACES

5792

Simple text, a description of a hotel and it’s amenities