I understand the general way is to create a summary of each topic, then get it to summarize all chapters’ summary. However, I have zero know-how on using the API to do this. How do I do it regularly through the layman interface without being tedious and having to prompt it multiple turns? I have 100s of transcripts to summarize.
what kind of file format is that?
it is in pure .txt format…
So a folder of n documents and you want a corresponding folder that has a summary of it?
Because a simple python script can do that trick…
summarizer_project/
├── src_files/ # Directory containing the source files to summarize
│ ├── file1.txt
│ ├── file2.txt
│ └── ...
├── dst_files/ # Directory where the summary files will be saved
├── summarizer.py # Your Python script for summarization
├── requirements.txt # File listing the Python dependencies
└── README.md # Optional, for explaining the project
you would just have to install python on your computer - ask chatgpt how to do that, then open a folder and create that structure…
like create a folder called summarizer
then go inside the folder and create a src_files folder and put all your txt files into it
and so on…
The source of the summarizer.py could be like this:
from openai import OpenAI
import os
# Initialize OpenAI client
client = OpenAI()
# A simple in-memory cache for demonstration purposes
cache = {}
def check_cache(prompt):
"""Checks if the response for the given prompt is cached."""
if prompt in cache:
print("Cache hit")
return cache[prompt]
print("Cache miss")
return None
def call_openai_api(prompt):
"""Calls the OpenAI API to get a response for the given prompt."""
# Check if the response is cached
cached_response = check_cache(prompt)
if cached_response is not None:
return cached_response
# If not cached, make an API call
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=1,
max_tokens=300,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
# Extract the response text
response_text = response.choices[0].message.content.strip()
# Cache the response for future use
cache[prompt] = response_text
return response_text
def chunk_text(text, chunk_size, overlap):
"""Splits text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start += chunk_size - overlap # Move the window forward with overlap
return chunks
def summarize_chunks(chunks):
"""Summarizes each chunk and combines the results."""
summaries = []
for chunk in chunks:
prompt = f"Summarize the following text:\n\n{chunk}"
summary = call_openai_api(prompt)
summaries.append(summary)
return summaries
def process_files(src_dir, dst_dir, chunk_size=1000, overlap=200):
"""Processes files in the source directory and saves summaries in the destination directory."""
for filename in os.listdir(src_dir):
if not filename.endswith(".txt"):
continue # Skip non-text files
src_path = os.path.join(src_dir, filename)
dst_path = os.path.join(dst_dir, filename)
# Read the source file
with open(src_path, "r", encoding="utf-8") as src_file:
text = src_file.read()
# Chunk the text
chunks = chunk_text(text, chunk_size, overlap)
# Summarize each chunk
summaries = summarize_chunks(chunks)
# Combine the summaries
final_summary = "\n".join(summaries)
# Save the summary to the destination file
with open(dst_path, "w", encoding="utf-8") as dst_file:
dst_file.write(final_summary)
if __name__ == "__main__":
# Define source and destination directories
src_dir = "src_files"
dst_dir = "dst_files"
# Ensure the destination directory exists
os.makedirs(dst_dir, exist_ok=True)
# Process files and generate summaries
process_files(src_dir, dst_dir)
This code is generated by ChatGPT - I just want to show the logic… this is the developer community - which means we help developers to build stuff…
If you are looking for a service then I would refer to ChatGPT to ask it who provides one - or use any other Search Engine
Thanks for the beginner’s guide! Just want to confirm a few things:
-
The output token limit is 8192 (approx 6K words), regardless of which model I use, correct?
-
I’m curious as to why is it that when I use the frontend ChatGPT (paid), the summaries are always 500+ words only even when told explicitly to go much longer than that. on a related note, when attaching 5 files, it just gives 5 x 100 word summaries of each doc instead.
-
When going through the API, is the quality similar or better than the frontend that regular users do?
-
To make it focus on RAG and not hallucinate, other than lowering temperature, what else can I do?