Doubts in development, embeddings and size maneuvering

Hi guys.

I’m trying to develop a conversational chat-bot using the API but I’ve just hit a dead end, because I started working with huge data like 40k ~ 150k rows.
I have a lot of doubts about how to work with huge data in this way, and whether it is necessary to use embeddings to relate the user’s question to my data-set, and this data-set being from SQL search results, PDFS and EXCEL spreadsheets.

I’d like to know how you do it and thank you in advance because I can’t think of anything else.

Is your data structured or loose?

That may vary but in a 90% of the cases structured.

You can start with function calling or even embeddings for specific categories and iteratively narrow down into a very specific sub-section of loose information.

IMO function calling is preferred as much as possible.

I’ve already tried using embeddings to narrow down the results but I didn’t like what it returned, just like you said loose data works very well, but when I do the same with the result of an Excel Query/Spreadsheet that is structured data with rows and columns, it doesn’t work very well.

Do you recommend studying how to use the assistant API to do this or continue with completions + embedding?

Transform the spreadsheet into an object and then use function-calling to narrow it.

Weaviate is a good option here. It’s a Knowledge Graph that stores both the content, and it’s embedding equivalent. You can use GraphQL to query both by the structure, and the loose semantics.

I have a private GPT I use and it handles the database fairly well. Fails on inline fragments but usually fixes it up on the second attempt.

Both are suitable. Assistants is just a layer on top of Completions. For prototyping Assistants would probably be better, and then moving to Completions once you are ready to increase control.

One thing to consider is that with Assistants you are locked into their proprietary services. Whatever you build will be tightly intertwined with their service, and increasingly be difficult to pull out if more control is required.

2 Likes

So, I was able to do a lot of the same things as I said while I was dealing with little data, but when I turned to 40k of data or more, these problems started to arise.

I’m going to research the ones you mentioned, but do you have any other study content that could help me with this project?

Sorry for asking so many questions and thank you for your attention.

The real problem here is that i need live informations and i dont know how to do it without cost too much.

1 Like

Then figure out a way to reduce the large data to little data. Try and find levels of granularity in your content.

I don’t, sorry. I would recommend tinkering and asking lots of questions :raised_hands:

If you’re dealing with data that changes a lot you can still use function calling. GPT models are great adapters for natural semantics -> database query. This way you’re not required to constantly be embedding everything. You can use embeddings of static content like categories


Actually, last I recall (it’s been a while) Pinecone was a great leader in documentation. It’d be worth checking them out. After a quick glance it looks like they’ve stagnated and focused on niche business areas, but there’s still some good old content there.

1 Like

I was trying to improve my book with images that I put in there but chat gpt didn’t add those images, why?