Optimizing unstructured text data to be used with OpenAI Retrieval?

Hello all, I am using OpenAI Assistant along with Retrieval feature. I am scraping website data and converting HTML outputting raw text then merging all of the text into a single text file to be used by my OpenAI Assistant. I am seeing good results, but the response time is slow at times (8s-10s). I have up to 7,000 lines and close to 300,000 characters in single text file. I’m curious if there are any AI tools available that could help optimize my text data for use with OpenAI Assistant that would result in improved response time? Any other tips welcomed and appreciated.

I’m not sure about actually optimizing the text data but I’d recommend using your own retrieval. Making an assistant function and then storing your text in a vector database that the function accesses would speed up the retrieval considerably.

Are you attaching the vector stores to Threads or the Assistant?

Also, if you are scrapping the data make sure to get only data which is meaningful. Like data in

tags, not including header menu items, footer items, etc. These can easily be achievable via BS4 Python Library.

Not sure what you mean by own retrieval. OpenAI Assistant V2 automatically converts the files into vectors and store them in Vector Storage.

1 Like

By own retrieval I mean do it yourself in your code. I have a function for my assistant called retrieval, when the toolcall for retrieval is done I go into my vector database and retrieve the relevant information. I’m not sure how good V2 assistants are with retrieval but V1 was not nearly good enough for me so I implemented my own solution. Doing the retrieval yourself allows for A LOT more customization.

How does adding a webhook function help with optimizing my text data for File Search? Also apologies, I used the term retrieval as I have been using OpenAI since v1. In v2 it’s now known File Search.

I’m currently using ConvertAPI https://convertapi.com, I feed it a sitemap and it converts each webpage into text file, then I merge all the results into single text file. Although yes, you are right I’m not omitting any Header, Menu Items, Footer etc… I need to find a way to do that. Thank you.

1 Like

ah I see you made a custom retrieval function, understood, makes sense. I’m trying to use the File Search in V2 with Vector Stores.

1 Like

Yea sorry, I can’t help with the File Search. I was just offering a different solution to speed up retrieval in case using File Search doesn’t end up being adequate.

1 Like

I would suggest you to give V2 a try. It is improved and now gives better results.

1 Like

I’ll certainly give it a test but for my use it needs to be almost perfect which I find unlikely. I’ve already almost finished my custom solution which works consistently well and for cheaper thank what I’ve heard from the default retrieval.

Ok an update from a quick few tests:

  1. File search performed better than I expected however it’s still not on par with a custom solution. It didn’t provide any better answers and sometimes missed details the custom solution did not.
  2. The token usage is between ~30% and ~50% higher for File search

Do you have some details on your custom solution? I’d like to learn how we can do custom retrieval. Any resources or information you can point me to?

1 Like

Yea! I just wrote a Function called retrieval and instructed my agent when to use it. I’m using RAG-Fusion along with some other methods to make sure all the relevant resources are pulled. Here are some good articles on some different techniques: Query Transformations Advanced RAG: Query Augmentation for Next-Level Search using LlamaIndex🦙 | by Akash Mathur | Medium
When I have my queries I just embed them and pull from my vector database (I’m using pinecone) using cosine similarity. Once I have all the potentially relevant results I use cohere’s reranker and return the top ranked resources.

All my documents are structured into chapters and sections so chunking is pretty easy and I just chunk by section rather than tokens. These chunks are embedded in the same way the query will be and then those embeddings aka vectors are uploaded to my pinecone index. Make sure the metadata for each embedding has a text variable with the original text of the embedding.

2 Likes

You don’t even need an expensive vector database. Text files work just fine and are fast. See free code here: Processing Large Amounts of Text in LLM Models

2 Likes

Thank you so much, your work looks interesting. cheers.

1 Like

No problem! If you’ve got any questions lmk and I’ll do my best to answer them :slight_smile: