How to summarize a large transcript file

Hi everyone, i am using Laravel

I have a large .txt file containing a transcript that I want to summarize using the OpenAI API. Due to token limits, directly sending the entire text as a prompt can be expensive and isn’t practical for large files.

I’m looking for a solution to upload the entire file and have OpenAI summarize it without manually splitting the text into smaller chunks. Are there any strategies or best practices for handling large files with the OpenAI API?

Has anyone encountered a similar issue or have suggestions on effectively processing and summarizing large text files using OpenAI?

Thanks in advance for your help!

Welcome to the Community!

We’ve got quite a bit of material here on the Forum discussing summarization strategies.

Note that if your input file exceeds the context window, then you really don’t have much of a choice but to look into chunking. Chunking is also critical if you’d like to achieve a certain granularity in your summaries. There’s different ways you can go about chunking.

I’m sharing in the following links to a few threads / posts discussing the topic:

3 Likes

Hi @sohailahmad0757 , did you consider batching your .txt file? https://platform.openai.com/docs/guides/batch/getting-started

You can chunk it almost randomly and feed it through batch api.
Not to mention, it is substantially cheaper and with higher rate limit.

Compared to using standard endpoints directly, Batch API has:

  1. Better cost efficiency: 50% cost discount compared to synchronous APIs
  2. Higher rate limits: Substantially more headroom compared to the synchronous APIs
  3. Fast completion times: Each batch completes within 24 hours (and often more quickly)
2 Likes

For this particular use case, summarizing (with custom prompt) I have found Gemini 1.5 to be very impressive. (And it has a 2m token window).
Overall I am not impressed with Gemini, but it is very good (and fast) with that.

1 Like

Hi sir i tried batches but i cant make sense of them can you help me with so i upload a batch input file

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher) keep the previous summery in memory "},{"role": "assistant", "content": "Can you tell me what the biggest planet in our solar system is?"},{"role": "user", "content": "Umm, I think it's Jupiter!"}],"max_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher) keep the previous summery in memory "},{"role": "assistant", "content": "That's correct! Jupiter is the largest planet in our solar system. Now, do you remember which planet is closest to the Sun?"},{"role": "user", "content": "Oh, that's Mercury!"}],"max_tokens": 1000}}
{"custom_id": "request-3", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher) keep the previous summery in memory "},{"role": "assistant", "content": "Great job! Mercury is the closest planet to the Sun. Here's a tricky one: which planet is known as the 'Red Planet' because of its color?"},{"role": "user", "content": "That's Mars! It looks all red and dusty."}],"max_tokens": 1000}}
{"custom_id": "request-4", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher) keep the previous summery in memory "},{"role": "assistant", "content": "Yes, that's right! Mars is often called the 'Red Planet.' And can you tell me why Earth is special compared to the other planets?"},{"role": "user", "content": "Because it has water and people live here!"}],"max_tokens": 1000}}
{"custom_id": "request-5", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher) keep the previous summery in memory "},{"role": "assistant", "content": "Exactly! Earth is the only planet we know of that has liquid water and life. You're doing a fantastic job! Last question: what do we call the path that planets take around the Sun?"},{"role": "user", "content": "That's called an orbit, I think!"}],"max_tokens": 1000}}
{"custom_id": "request-6", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher) keep the previous summery in memory "},{"role": "assistant", "content": "Yes, it’s called an orbit! Well done, you did great today. Keep up the good work!"}],"max_tokens": 1000}}

and i got this output file but i does not contain the summary in the response

{"id": "batch_req_OrkuupO2PeibMOPumwfIxZxe", "custom_id": "request-1", "response": {"status_code": 200, "request_id": "7ff671736759f33a1942c6be03fcdc7b", "body": {"id": "chatcmpl-A1YJezqNCy9rbRHvpwTGBlfaMKiWS", "object": "chat.completion", "created": 1724933602, "model": "gpt-4o-2024-05-13", "choices": [{"index": 0, "message": {"role": "assistant", "content": "That's correct! Jupiter is the biggest planet in our solar system. It is known for its massive size and its famous Great Red Spot, which is a giant storm.", "refusal": null}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 62, "completion_tokens": 33, "total_tokens": 95}, "system_fingerprint": "fp_157b3831f5"}}, "error": null}
{"id": "batch_req_KBSKfOSHeQlKXTw5z0YjhidY", "custom_id": "request-2", "response": {"status_code": 200, "request_id": "3487fd0b0e8c84857a2044f8f51d5ce1", "body": {"id": "chatcmpl-A1YJeaVv0GYzaQtyn7KCsZCeQ27gr", "object": "chat.completion", "created": 1724933602, "model": "gpt-4o-2024-05-13", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Exactly, Mercury is the closest planet to the Sun. Well done! Do you have any other questions about the solar system or any specific planet you're curious about?", "refusal": null}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 71, "completion_tokens": 32, "total_tokens": 103}, "system_fingerprint": "fp_157b3831f5"}}, "error": null}
{"id": "batch_req_tkuDWvE4FjzQkhTE46zcSkk7", "custom_id": "request-3", "response": {"status_code": 200, "request_id": "40deb5890dd6ca4f01dddb028d83c732", "body": {"id": "chatcmpl-A1YJernNzwZVJjaDEqK1o3FnW9pUk", "object": "chat.completion", "created": 1724933602, "model": "gpt-4o-2024-05-13", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Absolutely right! Mars is known as the 'Red Planet' because of its reddish appearance, which comes from iron oxide (rust) on its surface. Keep up the good work!", "refusal": null}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 82, "completion_tokens": 36, "total_tokens": 118}, "system_fingerprint": "fp_157b3831f5"}}, "error": null}
{"id": "batch_req_tBVX0TZHPMpSyefpQN3LX8Im", "custom_id": "request-4", "response": {"status_code": 200, "request_id": "203c6acfc1852e68457fee4c1989575f", "body": {"id": "chatcmpl-A1YJfTU69DuSIGZnRcD3no1si0zY3", "object": "chat.completion", "created": 1724933603, "model": "gpt-4o-2024-05-13", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Exactly! Earth is unique because it has liquid water, which is essential for life, and it supports a diverse range of living organisms, including humans.", "refusal": null}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 78, "completion_tokens": 30, "total_tokens": 108}, "system_fingerprint": "fp_157b3831f5"}}, "error": null}
{"id": "batch_req_ed9uVurVZ5rvGuOBHcA6SUik", "custom_id": "request-5", "response": {"status_code": 200, "request_id": "db29e08ff505df22c4f3268cb66e83d1", "body": {"id": "chatcmpl-A1YJhEJSG9IhLXUBooiBYSx0Doohm", "object": "chat.completion", "created": 1724933605, "model": "gpt-4o-2024-05-13", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Perfect! You're right; it's called an orbit. Planets follow a curved path around the Sun due to the gravitational pull. Great work today! We've covered a lot of interesting facts about the Earth's uniqueness, the solar system, and planetary orbits. Is there anything else you want to ask or explore?", "refusal": null}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 87, "completion_tokens": 61, "total_tokens": 148}, "system_fingerprint": "fp_157b3831f5"}}, "error": null}
{"id": "batch_req_4H4BtMZHK0mrbVVDeqF40dZ0", "custom_id": "request-6", "response": {"status_code": 200, "request_id": "0addc0c3726d2c8f6eabfac2015fbf3b", "body": {"id": "chatcmpl-A1YJe1JzW4czT1Y8goMQwmPmnSdw6", "object": "chat.completion", "created": 1724933602, "model": "gpt-4o-2024-05-13", "choices": [{"index": 0, "message": {"role": "assistant", "content": "Sure! Please provide the conversation details between the student and the AI (Teacher) that you would like summarized.", "refusal": null}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 58, "completion_tokens": 22, "total_tokens": 80}, "system_fingerprint": "fp_157b3831f5"}}, "error": null}
1 Like

Hi @sohailahmad0757 , did you begin the file preparation from this guideline topic onwards? I’m also finding it difficult to process what you have shared.

2 Likes

Hi @sohailahmad0757!

Is your goal to create a single summary of the full conversation flow, i.e.:

AI Teacher: “Can you tell me what the biggest planet in our solar system is?”
Student: “Umm, I think it’s Jupiter!”
AI Teacher: “That’s correct! Jupiter is the largest planet in our solar system. Now, do you remember which planet is closest to the Sun?”
Student: “Oh, that’s Mercury!”
AI Teacher: “Great job! Mercury is the closest planet to the Sun. Here’s a tricky one: which planet is known as the ‘Red Planet’ because of its color?”

If yes, then basically you need to include the full conversation flow in one line. This could look as follows (illustrative only - the user message is abbreviated in this example):

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "system", "content": "you need to create a summary of the conversation between the student and AI (Teacher)"},{"role": "user", "content": "Conversation: AI Teacher: Can you tell me what the biggest planet in our solar system is?, Student: Umm, I think it's Jupiter!, AI Teacher: That's correct! Jupiter is the largest planet in our solar system. Now, do you remember which planet is closest to the Sun?, Student: Oh, that's Mercury!, AI Teacher: Great job! Mercury is the closest planet to the Sun. Here's a tricky one: which planet is known as the 'Red Planet' because of its color? ... "}],"max_tokens": 1000}}

Now you said at the beginning that you have a large transcript file that exceeds the token limit. So this is where chunking comes into play. You need to chunk your conversation flow. Then each chunk of the conversation is included in one line of the batch request with a similar approach as above.

Note that each line is executed as a separate API call. There is no dependency between two lines and the content from the conversation in one line is not considered in the next line of your batch.

Once the batch job has been completed, you can concatenate the individual summaries into a complete summary. This may still require another API call to ensure the summaries are well connected, language aligned etc.

3 Likes