How to use o1 or 4o-mini api to analyse and summairze a large text doc like over 40k words as a newbie?

Hi guys! I got a large text file over 40k english words to process. I try to use o1 and 4o to help me use python and api analyse and summarize it. But actually as a little experience code newbie, the ChatGPT’s code usually try to use gpt4 api and split text to make it work. If spliting text, the summary and output will be uncorrect due to the incomplete information. Are there any ways to generate a long text file to a summary report with a fixed format (including a list of subtopics and corresponding excerpts from the original text)?

Hi!
Can you share a little more on your desired outcome? Do you just need to analyze summarize the doc for yourself or do you want to create a little program using the API to do that?
Here are some general pointers though:
GPT4o and o1-mini have a context window of 128k tokens, which should be able to take your text doc as input and summarize it (a token is generally 3/4s of an average length word). However, the output tokens are limited, most of the time to 8k tokens. That means you won’t get a long and very detailed summary. As a test, you can just put it in your prompt and ask which ever model you choose to summarize it. If you need to do also more analysis, that would most likely require some more work.

2 Likes

Ive developed a Tool to perform long text summaries. Send me the file and I’ll give you the summary. No mattershow long is the text. I transform a 200 pages file into a 20 pages file.

1 Like

Welcome to the Community!

You might find the following two threads helpful to get some ideas on how to approach summarization of longer documents:

2 Likes

Hi! Actually, what I want is the former. I want to analyze and get the mixed format file for myself and in my local device. Considering the limited output tokens, I need a effecient way to make it step by step.

Like Step 1 : use api to recognize, split and mark the text by sub-topics.
Step 2: use api to generate a sub-topics list at the beginning of the report.
Step 3: use api to summarize 3 sub-topics parts and select some key original content from each part. Generating three at a time like this can avoid output restrictions and, because the analysis is done on individual segments, the overall context meaning will not be lost.
Step4: considering the limit of output per min, maybe need to set a time rule for the process.

It sounds like what you are trying to do is similar to semantic chunking of a document.

We have an extensive thread on the topic here including techniques and code examples on how to accomplish that:

If your document is structured into different sections, which are demarked by section and subsection headers, then it is fairly straightforward to achieve that.

I have created my own summarization tool grounded in semantic chunking and can attest that you can generate fairly detailed summaries with this technique while ensuring that the logical flow of the original document is maintained.

1 Like

It’s a podcast transcript file therefore it should be processed by sub topics.

Alright, thanks for the additional context. However, I think you just answered your own question ; ). That is also pretty much how I would do it. What is it that you are struggling with? Is it the programming itself?