We are a small community of experts looking to implement AI to help answer questions on a fairly large dataset. Our data consists of approximately 30 GB spread across about 40,000 files.
We’re currently deciding whether to subscribe to the OpenAI Team Plan or develop our own custom AI solution. Our main challenge is the size of the dataset and the associated load times. With a custom solution, we know the initial processing time will be longer, but we expect response times to be manageable afterward.
My question is: would the Team Plan be suitable for this use case? Has anyone had experience working with large datasets on the Team Plan?
Thank you in advance for any insights and shared experiences!
I doubt you’re going to find any turn key solution that can work that much data out of the box. You’ll probably need to start with a search engine like elastic search and then layer AI on top of that. Do you already have a search engine?
This is likely to be the case where we have to commission an external party with the implementation. In this case, we were advised to build our own AI. My post was aimed at the question of whether there is already a ready-made solution that can solve this problem.
Or can LlamaIndex be set up by a reasonably experienced user?
You’re probably going to need to roll your own solution. There’s LangGraph and LlamaIndex. Of the two I personally think LlamaIndex is better designed but I don’t use either so I’ll defer to others for suggestions on the pros & cons of each.
It really depends on how much experience we’re talking about:
Can the person code with python comfortably and has used libraries before?
Does the Person know some things about AI or at least does the person know how to efficiently obtain information they are lacking?
Llamaindex makes it possible to implement a knowledge database with the fewest lines of code - for “newer” coders - but also offers a more customisable in depth solution for experienced developers.
LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.
I highly encourage you to have a quick read on their README on Github this will explain it better than I ever could.
Of course this would only be suitable for the Custom AI.
I can really only recommend this implementation because YOU control everything and mostly don’t have to rely on OpenAI.
Yeah I have a turn key solution for reasoning over large document collections but even I would struggle with 40,000 documents (my current cap is around 5,000 documents) but my system doesn’t do RAG and it’s really not a good fit for what I’m assuming your scenarios to be. You probably want something that uses mostly search at the core with a dash of RAG.
Could this work? Imagine blending GPT’s conversational style with X1’s powerful search abilities. You could ask questions naturally and instantly pull up, sort, or get summaries of your desktop data—all in a quick, secure way. It’s like having a super-smart assistant that doesn’t just find what you need but gives you the insights without the hassle.
You could also setup a searchable web archive build a API to index and access it then through actions link your API to the GPT but that would cost both membership and API fees