How can i efficiently train a gpt on hundreds of thousands words of video transcripts?

Hello everyone,

I’m working on a project involving training a GPT model, and I’m looking for advice on handling a significant volume of data derived from video transcripts. My goal is to develop a model that can effectively mimic a certain style of communication, reflecting nuanced personality traits and empathetic responses. I have a few questions and would greatly appreciate any insights or experiences you could share:

  1. Data Preparation and Management: What are the best practices for preparing and managing hundreds of thousands of words from video transcripts for GPT training? How do you ensure the data is clean and organized for optimal training efficiency?
  2. Training Process: Considering the large volume of text, what strategies have you found effective for training a GPT model? Are there specific techniques to enhance the learning process, particularly in capturing the nuances of personality and empathy in communication?
  3. Ethical and Legal Considerations: How do you navigate the ethical and legal aspects of using video transcripts for training purposes? What steps can be taken to ensure compliance with intellectual property laws and ethical guidelines?
  4. Resource Allocation: Training a model on such a large dataset requires significant computational resources. What are your recommendations for managing these resources effectively? Any tips on balancing cost and performance would be especially helpful.
  5. Quality Assurance: After training, how do you evaluate the model’s performance in accurately reflecting the desired communication style? What metrics or methods do you use to assess the quality and reliability of the outputs?
1 Like

Hey there and welcome to the community!

To start, how much data are we talking about here file-size wise?

I’m guessing your goal is to fine-tune a GPT model, correct? As in, take a model that’s already pre-trained, like GPT 3.5 turbo, and then add more data to make it act like what you want?

Since you’re asking about mimicking a certain level of style and empathic responses, are we positive fine-tuning is the way to go here? Have you tried simply creating highly specialized system prompts or prompts to elicit this particular style?

Fine tuning for this is trickier than meets the eye. And yes, it is also time consuming and can become expensive quite quickly. I would guess there would be a large chunk of pre-processing the data involved in order to prevent unintended consequences. Or, in your words, data preparation and management. Because of this, and the inherent unpredictability that comes along after fine-tuning a model, it’s typically recommended to try seeing if prompting can lead to success first before you go this route.


Thanks for your query. Initially, we’re starting with a relatively small dataset, around 1MB or approximately 150,000 words. However, we’re prepared to scale up significantly. I anticipate that we might eventually work with datasets comprising millions of words, especially considering the upload limit is a substantial 512MB.

We’ve already experimented with prompt engineering but found that it didn’t quite meet our expectations. To get a better sense of what’s possible, we uploaded about 10,000 transcribed words and were quite pleased with the results. This success has made us optimistic about the potential improvements in quality as we increase our dataset size into the millions.

For the initial phase of our project, we’re focusing on training a GPT model. This is essentially our testing ground. Once we’ve refined our approach and are satisfied with the results, we plan to transition to training an assistant model. This step will be crucial for ensuring the model’s effectiveness in more dynamic, interactive scenarios.

Our objective extends beyond just emulating a particular communication style or personality traits. Our ambition is to fine-tune the GPT model to become deeply knowledgeable in a specific niche. This involves training the model comprehensively on data extracted from various videos. The key is not just for the model to learn the data but to understand how to effectively utilize and apply this information. So, it’s not just about mimicking a style; it’s about developing an in-depth understanding and operational capability within a particular domain.

To do this you need to first create a GPT neutralizer, normally through prompting.

So your transcript text gets processed by the neutralizer and outputs neutral text.

Your fine-tune training data then has this neutral text as the input “prompt” and the original transcript text as the output “completion”.

Then you create a JSONL file with this set of prompt/completion pairs, and train away.

After training, you just input whatever text into your fine-tuned model, and it will come out as stylized text based on your transcripts.

To inject new knowledge, you may have to still use RAG in conjunction with the fine-tuned model.

The goal of your fine-tune is stylization. So it acts as a filter that transforms any text into text that mimics the style of the transcripts.

If there is knowledge in the transcripts you want to capture. Then capture these chunks, preferably as neutral chunks, and then re-feed them through the fine-tune to add back in the correct style. But these neutral chunks are retrieved with your RAG system, and injected into the fine-tune that gives style.

So you have one system that specializes in the knowledge of the transcripts (neutral RAG) and another model that specializes in transforming this data to have the correct style (your fine-tune).

But don’t expect knowledge and style to be captured in the same fine-tune … that is the mistake many make.


Gotchu! I’ll give that a try and see how it goes. Thanks for the guidance!

Also, I’m curious about the process of transitioning from a GPT model to an assistant model. How complex is this replication or translation? Is it a matter of simply uploading the JSONL file to the assistant model, or are there additional steps and considerations involved? Any insights or advice on this would be really helpful!

Good :+1:

As for Assistants or GPTs … I only use the API with the “raw” models (GPT-4/3.5), without any thin-wrappers from GPTs or Assistants. Maybe others can chime in on that transition, if there is confusion after reading the docs.

But you need to transition to fine-tunes, which is like using raw API models without the thin-wrap.

So to keep everything on the same level, just use this “raw” model family IMO.

The only exception is maybe the RAG part, where you might be inclined to use Assistants. So maybe do that. But I would quick spin up my own RAG if I were you, to gain more control of the overall system.