Building Own Knowledge Base LLM


I want build my own knowledge base using Language Model (LLM), utilizing over 40GB of data including books and research papers. I’m eager to hear your suggestions and insights on how to approach this endeavor.

Specifically, I’m seeking guidance on:

  1. What methodologies or frameworks would you recommend for building a robust LLM using my dataset?
  2. Data preprocessing techniques: How should I preprocess the data to ensure optimal performance and efficiency in training the model? Any specific tools or libraries you suggest for this task?
  3. Fine-tuning or RAG models: Would fine-tuning existing models or implementing RAG (Retrieval-Augmented Generation) models be beneficial for this project? If so, what are some best practices or resources to consider?

Your expertise and advice would be immensely valuable in guiding me through this journey.

So I think that a knowledge base and language model implementations are different.
A knowledge base provides structured information, while a language model focuses on understanding and generating text.

I think what you want is pretty much understanding context and generating text.

To get started I would recommend learning about AI fundamentals and exploring various model architectures to understand how language models work and how they can be trained using a range of datasets.

1 Like

Hello Somesh,

Developing a large language model (LLM) is complex and demands significant resources. I’m curious about your motivation for building an LLM from scratch. Could you share the specific use case you have in mind for this model? Have you considered fine-tuning an existing model instead of creating a new one from the ground up?

Are you looking to use your own data with LLMs? If that’s the case, you might want to explore using Retrieval-Augmented Generation (RAG). This approach allows you to use an existing LLM and enhance it with your data, eliminating the need to build a new model entirely.
For that you can start with Langchain or LlamaIndex.

But for 40GB of Data, RAG is not a good option. Also i want to know about your DataSet is it consists of structured labeled data or unstructured Raw Data. In my opinion if you have 40GB structured labeled data(which i believe you to create) i recommend you Fine-tuning. But be careful about Fine-tuning, first start by Fine-tuning on very small data sets and test how it works.


Hello @Innovatix

Thank you for your response.

Yes, I’m utilizing my own data with LLMs (RAG). As for the dataset, it comprises semi-structured and unstructured data, such as PDFs of research papers and books, including images.
Could you clarify what specific preprocessing steps you’re considering for this diverse dataset?