Building Own Knowledge Base LLM

somesh1 · April 5, 2024, 9:20am

Hello

I want build my own knowledge base using Language Model (LLM), utilizing over 40GB of data including books and research papers. I’m eager to hear your suggestions and insights on how to approach this endeavor.

Specifically, I’m seeking guidance on:

What methodologies or frameworks would you recommend for building a robust LLM using my dataset?
Data preprocessing techniques: How should I preprocess the data to ensure optimal performance and efficiency in training the model? Any specific tools or libraries you suggest for this task?
Fine-tuning or RAG models: Would fine-tuning existing models or implementing RAG (Retrieval-Augmented Generation) models be beneficial for this project? If so, what are some best practices or resources to consider?

Your expertise and advice would be immensely valuable in guiding me through this journey.

cubanzemulax · April 5, 2024, 9:52am

So I think that a knowledge base and language model implementations are different.
A knowledge base provides structured information, while a language model focuses on understanding and generating text.

I think what you want is pretty much understanding context and generating text.

To get started I would recommend learning about AI fundamentals and exploring various model architectures to understand how language models work and how they can be trained using a range of datasets.

Innovatix · April 5, 2024, 6:32pm

Hello Somesh,

Developing a large language model (LLM) is complex and demands significant resources. I’m curious about your motivation for building an LLM from scratch. Could you share the specific use case you have in mind for this model? Have you considered fine-tuning an existing model instead of creating a new one from the ground up?

Are you looking to use your own data with LLMs? If that’s the case, you might want to explore using Retrieval-Augmented Generation (RAG). This approach allows you to use an existing LLM and enhance it with your data, eliminating the need to build a new model entirely.
For that you can start with Langchain or LlamaIndex.

But for 40GB of Data, RAG is not a good option. Also i want to know about your DataSet is it consists of structured labeled data or unstructured Raw Data. In my opinion if you have 40GB structured labeled data(which i believe you to create) i recommend you Fine-tuning. But be careful about Fine-tuning, first start by Fine-tuning on very small data sets and test how it works.

somesh1 · April 8, 2024, 6:24am

Hello @Innovatix

Thank you for your response.

Yes, I’m utilizing my own data with LLMs (RAG). As for the dataset, it comprises semi-structured and unstructured data, such as PDFs of research papers and books, including images.
Could you clarify what specific preprocessing steps you’re considering for this diverse dataset?

Topic		Replies	Views
Could Someone Give me Advice on Best Practices for Training Large Language Models? Community large-language-model , training	0	373	April 29, 2024
Fine-Tuning an LLM for Dynamic JSON Configuration Generation Community chatgpt	0	168	July 23, 2024
Custom GPT Model Training for Unversity LMS Courses (E-Learning) API chatgpt , api	1	1075	February 23, 2024
Leveraging LLMs with Vast Mechanic Datasets and Guides API api	6	1894	August 31, 2023
Better approach to build a chatbot on llama2 Community gpt-4	0	623	October 25, 2023

Building Own Knowledge Base LLM

Related topics