How to build summaries of 2H long transcripts without using API?

I understand the general way is to create a summary of each topic, then get it to summarize all chapters’ summary. However, I have zero know-how on using the API to do this. How do I do it regularly through the layman interface without being tedious and having to prompt it multiple turns? I have 100s of transcripts to summarize.

what kind of file format is that?

it is in pure .txt format…

So a folder of n documents and you want a corresponding folder that has a summary of it?

Because a simple python script can do that trick…

summarizer_project/
├── src_files/            # Directory containing the source files to summarize
│   ├── file1.txt
│   ├── file2.txt
│   └── ...
├── dst_files/            # Directory where the summary files will be saved
├── summarizer.py         # Your Python script for summarization
├── requirements.txt      # File listing the Python dependencies
└── README.md             # Optional, for explaining the project

you would just have to install python on your computer - ask chatgpt how to do that, then open a folder and create that structure…

like create a folder called summarizer

then go inside the folder and create a src_files folder and put all your txt files into it

and so on…

The source of the summarizer.py could be like this:

from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI()

# A simple in-memory cache for demonstration purposes
cache = {}

def check_cache(prompt):
    """Checks if the response for the given prompt is cached."""
    if prompt in cache:
        print("Cache hit")
        return cache[prompt]
    print("Cache miss")
    return None

def call_openai_api(prompt):
    """Calls the OpenAI API to get a response for the given prompt."""
    # Check if the response is cached
    cached_response = check_cache(prompt)
    if cached_response is not None:
        return cached_response

    # If not cached, make an API call
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=1,
        max_tokens=300,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    # Extract the response text
    response_text = response.choices[0].message.content.strip()

    # Cache the response for future use
    cache[prompt] = response_text

    return response_text

def chunk_text(text, chunk_size, overlap):
    """Splits text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - overlap  # Move the window forward with overlap
    return chunks

def summarize_chunks(chunks):
    """Summarizes each chunk and combines the results."""
    summaries = []
    for chunk in chunks:
        prompt = f"Summarize the following text:\n\n{chunk}"
        summary = call_openai_api(prompt)
        summaries.append(summary)
    return summaries

def process_files(src_dir, dst_dir, chunk_size=1000, overlap=200):
    """Processes files in the source directory and saves summaries in the destination directory."""
    for filename in os.listdir(src_dir):
        if not filename.endswith(".txt"):
            continue  # Skip non-text files

        src_path = os.path.join(src_dir, filename)
        dst_path = os.path.join(dst_dir, filename)

        # Read the source file
        with open(src_path, "r", encoding="utf-8") as src_file:
            text = src_file.read()

        # Chunk the text
        chunks = chunk_text(text, chunk_size, overlap)

        # Summarize each chunk
        summaries = summarize_chunks(chunks)

        # Combine the summaries
        final_summary = "\n".join(summaries)

        # Save the summary to the destination file
        with open(dst_path, "w", encoding="utf-8") as dst_file:
            dst_file.write(final_summary)

if __name__ == "__main__":
    # Define source and destination directories
    src_dir = "src_files"
    dst_dir = "dst_files"

    # Ensure the destination directory exists
    os.makedirs(dst_dir, exist_ok=True)

    # Process files and generate summaries
    process_files(src_dir, dst_dir)

This code is generated by ChatGPT - I just want to show the logic… this is the developer community - which means we help developers to build stuff…

If you are looking for a service then I would refer to ChatGPT to ask it who provides one - or use any other Search Engine

Thanks for the beginner’s guide! Just want to confirm a few things:

  1. The output token limit is 8192 (approx 6K words), regardless of which model I use, correct?

  2. I’m curious as to why is it that when I use the frontend ChatGPT (paid), the summaries are always 500+ words only even when told explicitly to go much longer than that. on a related note, when attaching 5 files, it just gives 5 x 100 word summaries of each doc instead.

  3. When going through the API, is the quality similar or better than the frontend that regular users do?

  4. To make it focus on RAG and not hallucinate, other than lowering temperature, what else can I do?