Team Plan or Custom AI for Large Dataset?

hoffi.confusion · November 7, 2024, 11:15am

We are a small community of experts looking to implement AI to help answer questions on a fairly large dataset. Our data consists of approximately 30 GB spread across about 40,000 files.

We’re currently deciding whether to subscribe to the OpenAI Team Plan or develop our own custom AI solution. Our main challenge is the size of the dataset and the associated load times. With a custom solution, we know the initial processing time will be longer, but we expect response times to be manageable afterward.

My question is: would the Team Plan be suitable for this use case? Has anyone had experience working with large datasets on the Team Plan?

Thank you in advance for any insights and shared experiences!

stevenic · November 7, 2024, 11:43am

I doubt you’re going to find any turn key solution that can work that much data out of the box. You’ll probably need to start with a search engine like elastic search and then layer AI on top of that. Do you already have a search engine?

j.wischnat · November 7, 2024, 11:45am

Hey!

Check out LlamaIndex which is probably the easiest and smoothest solution to your problem.

Do let me know if this helped you out.

hoffi.confusion · November 7, 2024, 12:49pm

@stevenic:

We are a small community of experts, so far we know the content of the 40.000 files

At the moment, everyone still organizes their data themselves. I work with the desktop search engine x1

@j.wischnat:

This is likely to be the case where we have to commission an external party with the implementation. In this case, we were advised to build our own AI. My post was aimed at the question of whether there is already a ready-made solution that can solve this problem.

Or can LlamaIndex be set up by a reasonably experienced user?

stevenic · November 7, 2024, 12:56pm

You’re probably going to need to roll your own solution. There’s LangGraph and LlamaIndex. Of the two I personally think LlamaIndex is better designed but I don’t use either so I’ll defer to others for suggestions on the pros & cons of each.

j.wischnat · November 7, 2024, 1:04pm

It really depends on how much experience we’re talking about:

Can the person code with python comfortably and has used libraries before?
Does the Person know some things about AI or at least does the person know how to efficiently obtain information they are lacking?

Llamaindex makes it possible to implement a knowledge database with the fewest lines of code - for “newer” coders - but also offers a more customisable in depth solution for experienced developers.

LlamaIndex provides tools for both beginner users and advanced users. Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code. Our lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.

I highly encourage you to have a quick read on their README on Github this will explain it better than I ever could.

Of course this would only be suitable for the Custom AI.

I can really only recommend this implementation because YOU control everything and mostly don’t have to rely on OpenAI.

stevenic · November 7, 2024, 1:09pm

Yeah I have a turn key solution for reasoning over large document collections but even I would struggle with 40,000 documents (my current cap is around 5,000 documents) but my system doesn’t do RAG and it’s really not a good fit for what I’m assuming your scenarios to be. You probably want something that uses mostly search at the core with a dash of RAG.

mitchell_d00 · November 7, 2024, 1:20pm

Could this work? Imagine blending GPT’s conversational style with X1’s powerful search abilities. You could ask questions naturally and instantly pull up, sort, or get summaries of your desktop data—all in a quick, secure way. It’s like having a super-smart assistant that doesn’t just find what you need but gives you the insights without the hassle.

You could also setup a searchable web archive build a API to index and access it then through actions link your API to the GPT but that would cost both membership and API fees

Topic		Replies	Views
Ask questions on a large dataset (not only for search use case) API chatgpt	1	57	December 4, 2024
Making a Custom MYGPT for organizational purpose GPT builders chatgpt	4	247	November 6, 2024
AI Search using big ammount Data without VECTOR Prompting chatgpt , assistants-api	3	149	November 27, 2024
Using my own knowledge base with GPT-4 Community gpt-4	9	20822	December 18, 2023
Writing a ChatBot (not just for Q&A) is hard! 2 months in and still unsuccessful :/ Prompting gpt-4 , chat-completion	8	3423	January 27, 2025

Team Plan or Custom AI for Large Dataset?

Related topics