How do I summarise a block of text larger than the token limit?

(Please excuse my massive ignorance on this topic, I’m very new here and I don’t know what I don’t know.)

How do I get summaries of large blocks of text from the OpenAI API? It seems the answer used to be /answers but that is deprecated as of a couple of weeks ago. It seems to have something to do with
“embeddings” but I don’t see how—I send a chunk of text to the /embeddings endpoint, I get some vectors (?) back and then I do what with them exactly?

(Our specific use case is taking dozens of one-paragraph customer reviews and then obtaining a few bullet points of the themes of the whole set of reviews in aggregate.)

4 Likes

I would probably use text-davinci-03 and give it an example or two…

You could make a prompt like “Summarize the text below…”

Then have a long paragraph and underneath it the summary. Hope this helps! Let me know if you have more specific questions.

(And welcome to the forum!)

1 Like

That’s the issue—when “the text below” is too many tokens to include in the prompt in one go, what do I do?

I would summarize in chunks then.

Maybe 1000 tokens for the prompt which would leave around 1000 tokens for the summary… or maybe 1500 / 500…

How long is the text you’re trying to summarize?

That’s not what I want to do though. It’s not one huge slab of text like a magazine article, it’s many single paragraphs (written by different people, like reviews) that I want a summary of in aggregate.

That would make it even easier to batch process them, I think. Include XX number of paragraphs (depending on length) then have it summarize…

Then maybe have it summarize the summaries?

Do you have an example of a prompt you have tried already?

It doesn’t make intuitive sense to me that this would work.

Seems to me you could end up “gerrymandering” your results and losing information, depending on how things were grouped. For example, if you summarised nine reviews in batches of three which said (for simplicity):

Red
Blue → Summary: Blue
Blue

Red
Blue → Summary: Blue
Blue

Red
Blue → Summary: Blue
Blue

… your final summary will be “Blue”, but a third of your reviews said “Red” and you’ve lost that information.

This is an oversimplification obviously, but do you understand what I mean? Small signals in each sub-batch will get “averaged out” of successive summaries, despite being significant in aggregate.

You can use this google sheet here which is integrated with GPT3

What I would do is break the piece into random chunks, summarize each random chunk, and then perform a summary of summaries. This process could even be repeated a few times, and since splitting is random, it will give a few different results, which itself can then be integrated. But that might be expensive computationally.

Doing it that way kind of implies there needs to be a human in the loop to assess the quality of the summary, but that defeats the purpose of our use case.

In any case, the size of our input is such that we will probably only need one recursive step so that isn’t too bad. A bit of prompt engineering to pull out the “repeating” and “unique” elements of the first-level summaries might help.

1 Like

The comparison between segments could be handled by openai itself. Basically, tell the system to take the different summaries and create one final summary from them. Of course, you can test it a few times and see if the just one iteration tends to give a good summary, and if it does, scrap the need to do multiple runs.

1 Like

Personally, I would use embeddings. Those number vectors you’ve seen are used to compare texts with each other, which means you can first cluster or classify your paragraphs before you summarize. If i wanted to know if the majority of my customers had something positive to say, I’d crate a “positive review” class and a “negative review” class, classify the paragraphs each as either positive or negative and then calculate the percentage positives. You could also cluster in an unsupervised way, creating 10 classes that would group your paragraphs by type, which you might find everybody either complianing or praising the colour of the product will go in one class, everybody who said their kid loves it in another etc. I would then summarize a random set of 20 per class, as that would give you the just of each class, or I’d just summarize the dominant class as that would be what the majority of people said.

How did you previously use the answers endpoint? Did you merely send all paragraphs and ask a question? What was your query/question? I wouldn’t be surprized if that depricated answers endpoint use to first embed the question and your passages, find all paragraphs that most closely relates to the question and generate your answer based on only a few of the top matching passages. If that’s the solution you want to replicate, then also experiment with questions rather than asking for a summary. Randomize your paragraphs, select any 20, ask text-davinci-003 what did the majority of customers think of this product?

Hi. If you are dealing with less than 500-1000 reviews, it’s probably faster and easier for a human to summarize them, rather than setting up an NLP pipeline. Assuming you are dealing with a large volume of reviews, here is a brainstormed NLP pipeline that you could try:

  1. Assign a unique ID to each review.

  2. Perform two different, separate types of classification of all the reviews:
    a) Set the classes as “positive”, “negative” and “neutral.”
    b) Set the classes as “detailed” and “generic.”
    (I haven’t done classification myself but I think you can just provide 2 or 3 examples of each class for the API to learn.)

  3. Using the reviews’ unique IDs, find the ones that, based on the two classifications, are BOTH “generic” AND “neutral.” These are reviews that probably won’t be very useful for writing a summary. So drop those reviews from your dataset.

  4. However, do make a note of what proportion of reviews these “useless” ones represent, since that is rich information in itself. Is it half your dataset? Or just 10%? or 90%? Depending on the proportion of “useless” reviews, your summary could include a sentence like “Approximately one third of customers’ reviews were generic and neutral, so not very useful for business decision-making. For the remaining 2/3 of customer reviews, here is a summary of them:…”

  5. Get vector embeddings for each remaining “useful” review. (Since they are only one paragraph each, no need to splice or summarize them before getting the embeddings.) Associate each review’s unique ID to its corresponding embedding.

  6. As @carla suggested, cluster the embeddings by semantic similarity in an unsupervised manner. For the steps below, let’s just assume there are 15 clusters.

  7. Each of the 15 clusters will contain n semantically similar reviews. For each cluster, get the IDs for each review, and divide them using the classes that you created earlier. Those classes will be “positive”, “negative” and “neutral and detailed” (because you dropped “neutral and generic” from the dataset).

  8. So for 15 clusters, each will be divided three ways, for a total of 45 groupings (unless some clusters don’t have all 3 categories in them, that’s ok). The reviews in each grouping are semantically similar AND share the feature of being either positive, negative, OR neutral/detailed.

  9. From each of the 15 “positive” groupings and 15 “negative” groupings, it’s probably safe to drop any reviews that were classed earlier as “generic,” since those won’t be very useful for the summary.

  10. Now your dataset has 3 categories of reviews:
    i) semantically similar, detailed negative reviews
    ii) semantically similar, detailed positive reviews
    iii) semantically similar, detailed neutral reviews.

  11. Check the token count of the 3 categories. Is each category < approximately 4000 tokens? If so, summarize each category separately using text-davinci-0003. Each summary can be one paragraph in your final summary, plus you can add the sentence noting the proportion of “useless” reviews mentioned above, if you think it’s helpful.

  12. But what if one or more of the 3 categories is > 4000 tokens? I’ll brainstorm about that and try to write a follow-up post…my brain is tired right now…:grinning.

I hope the above is helpful. Leslie

4 Likes