Customized GPT4 - How to designate files for general knowledge/training?

Hi. I’m a non-programmer trying to build a customized GPT4 as an educational tool for students (I am not a developer, dont have API access).

I curated a number of specific training files (PDFs and datasets) that I’d like to use to train the customized GPT. I’ve uploaded all of these as “knowledge” files (topic1, topic2, topic3, etc.).

I’d like the customized GPT to have already been trained on certain files prior to a user’s first interaction. Currently, the GPT will interact with a user, and then read a knowledge file if it is relevant to the user’s query (or maybe it tries to read all the files for every query, not sure). This ends up taking a lot of time.

Is this possible?



Welcome to the community!

So, what exactly are you trying to do, and what is it doing that is insufficient?

If you’re asking if you can fine-tune a model with custom documents (training a pre-trained model), then unfortunately you will need to use APIs to both do that and call the Assistant model that is fine-tuned.

You can store knowledge using GPT builder (which is what I’m assuming you’re using) to upload files and retrieve documents as you expressed. Depending on the documentation and the prompt, it should also be able to work with that knowledge to generate responses relevant to that knowledge base.

The main difference between the two is that fine-tuning hard-bakes the documents into the model forever (like a cake!), but knowledge in GPT builder (to the best of my understanding; it’s a new tool for all of us right now) is likely using vector databases to retrieve data, like a search function. This is called RAG, or retrieval augmented generation. RAG does not fine-tune a language model, but rather gives the language model a set of really, really long arrays of numbers which which it can more easily retrieve information from.

If the model is producing a response that is desirable and intended, then keep going with it. The speed cannot be helped much until these tools evolve more unfortunately, but to be honest, if it’s generating results you want outside of that, you’re doing really, really good already, as that’s the “hard” part with these models for most people.

Fine tuning is more for allowing the model to more easily perform specific or niche chat-based use-cases.

So, in summary, If speed is the only issue here, sadly there’s not much that can be done with that.


Wow, thanks! Very clarifying and much appreciated. So I’m a researcher toying with the idea of creating a tool to assist students locate resources for college admissions, and provide person-specific feedback on application components.

Based on your response, I’m not sure whether I should be “fine-tuning” the curated information via an API or whether I can get away with presenting the info as searchable knowledge in the builder. Will try to get away with the latter for now since I don’t know how to do the former :slight_smile:

One follow-up question if you don’t mind – if I stick with the builder and knowledge docs, can I specify which user queries should prompt which docs? For example, suppose I have 50 knowledge docs (not even sure that’s possible). Can I tag each doc in the instructions with detailed key words, descriptors so that a particular user query on Topic A doesn’t call all 50 docs (but only the relevant 1-2)? I appreciate your point about speed, but imagine it would be prohibitively slow for the builder to read 50 docs for each question (unless the model searched them all just once every user-interaction-- I suppose that could work?).

Anyway, thanks again - I’ll pass it along when it’s done!


So, your follow up question touches upon why RAG is so popular and useful (but also imperfect).

Short answer: Yes. The language models themselves pick which docs to use and retrieve from based on the text within the documents. You can work and tinker around with different instructions for having greater control over how it picks what it does, but in general, this is why these models are so powerful.

Long answer: So, parsing the doc into different files (like 1 page as 1 file, or 1 chapter as 1 file) is all you need to do in that case to enhance speed. Basically, the LLM should be able to take the user’s input, and do it’s best to infer which doc to pull based off the input prompt alone, and then generate an output using data from the doc it retrieved. If “Topic A” is about cats for example, and there’s a doc/file about cats, and the user asks about cats, it will pull the relevant doc based off the natural language of the user and the content inside the doc alone.
It’s pretty neat stuff.

That being said, keep in mind, this is a very new tool, with little to no documentation yet. There’s not much clarity here yet, so I can only provide this explanation under the assumption it functions identically to the vector databases we are used to building already in this community. There may be subtle nuances or differences that will be revealed later, but the gist should still be the same.

You have the right ideas and the right mindsets, and what you want to achieve is a great example of what these features are intended to be built for. I would start experimenting and seeing what you can do with what you now know. It’s going to take some trial and error, and some learning on how to prompt for what you want the builder to implement for you, but you should have all you need to start trying things out!

I hope this helps, and good luck!


Thanks again, I feel a lot better about this after this feedback and encouragement. Pretty inspiring what a non-programmer can accomplish on this platform. With a few hours of trial and error, I managed to nudge the GPT to open the right documents and develop an understanding of the user through an intake process. It’s still super slow and encounters occasional errors but I’ll try to reduce the file sizes, as you suggested. Seems like one of the keys here, at least without the benefit of fine-tuning, is to invest in creating sufficiently curated documents so the LLM has a well-defined task. Anyway, that’s my current strategy.

It’s also interesting to think about how to evaluate progress – I’ve settled on comparing responses from the customized GPT with regular GPT4.

In any case, thanks again for all your generous help! Will send it to you when done if you’re curious.