Document Cutting

malakarnikheil · August 1, 2021, 10:49am

For a paragraph that I have.

Hi, My name is Nik, I really like to walk. I walk litterally everywhere.
Walking gives me a break from everything and clear my mind.
Everyone should take at least 2 hours of walk a day.
A lot of famous do too like world leaders.

how should I divide this paragraph?
What happens if these documents are broken up to too many peices?
What if the document is one whole peice?
What is recommended?

Peice = {“text”:"This is a peice : ) "}

carla · August 1, 2021, 12:20pm

malakarnikheil:

Hi, My name is Nik, I really like to walk. I walk litterally everywhere.
Walking gives me a break from everything and clear my mind.
Everyone should take at least 2 hours of walk a day.
A lot of famous do too like world leaders.

This piece of text is 57 tokens, which you can calculate here: Token estimator

Then keep in mind, if you’re calling the completion API endpiont, or training your own fine-tune model, each call or each fine-tune example is limited to 2048 tokens, which is the sum of both your prompt and the model’s completion text.

Documents may contain multiple paragraphs, and you can also supply multiple paragraphs in a single prompt or as expected generated text in a completion (for training examples, for instance.) In a string, each paragraph might be separated with a new line character “\n”. As long as the sum of all paragraph’s tokens is still under 2048, the combined text can still be used in a single API call or as a single fine-tune training example.

The paragraph you’ve given as an example is very small in comparison to the 2048 limit, but what output would you expect from the model?

I notice you’re using {“text”: opposed to {“prompt”:, so are you planning on calling the classifications endpoint? In this case, the entire paragraph you’ve specified can be used as the “text” string (and then you’ll specify the “label”: “Health” next to it, as can be seen here: https://beta.openai.com/docs/guides/classifications)

Notice, the same can be achieved (or in my experience actually with better accuracy) by using fine tuning, although in that case, your fine-tuning training file will not consist of text-label pairs, but prompt-completion pairs.

The benefit of shorter pieces of text is that it’s more specific and easier for the model to learn from, so if you’re classifying one sentence at a time, you can potentially achieve higher levels of accuracy, whereas if you’re using full paragraphs, the model will have greater context, improving “understanding” of the text. I personally would recommend feeding an entire paragraph at a time, for the contextual benefit.

Perhaps explain your use-case, you’ll get more responses from the community.

malakarnikheil · August 3, 2021, 4:17am

Thanks Carla,

My usecase is simple and it is answering questions from a book about 100 to 200 pages worth.
The only reason I was using {“text”: is becasue I would upload the file and use it for answering the question I have about the book (completion).

I have got the answer to my question so thank you very much and I will upload it paragraph by paragraph.

Have a good day!

malakarnikheil · August 3, 2021, 1:49pm

Another question here.

If I have a text like this:

Title
text
text

Title 
text
text
text

Should I separate them into two titles or to all individuals?
Does the order of the text and Title matter at all?

Thanks

carla · August 3, 2021, 5:16pm

I deal with this problem too. In my case, I write software that analyzes insurance policies that might state, for example:
Coverages C - Personal Property
We cover:
xxx
We do not cover:
xxx
xxx
8. Animals

…and somewhere else in the same document, I have:

PERILS INSURED AGAINST
We cover the following causes of loss or damage:
xxxx
xxx
xxxx
3. Animals

So, the first is saying the insurance policy doesn’t cover Animals (as in live stock) and the second piece says that the policy covers Animals (as in Animal Damage) meaning the titles are very important. I overcome this issue by building “text-branches” that look like this:

Coverages C - Personal Property
    We do not cover:
        8. Animals

and

PERILS INSURED AGAINST
    We cover the following causes of loss or damage:
        3.  Animals

and now that we’re starting to use GPT-3 for processing these texts, I feed text into the model as text branches.

I don’t know if this will work for your use-case, but you can think of text-branches as a tree structure of your document, where any paragraph in your document can have multiple titles and subtitles, all of which adds context to that paragraph.

It is my understanding that, when fine tuning or when supplying texts for search or classification, each example or json line is treated in isolation. One json line does not carry context from a previous json line in a training file, as text in a normal document would typically do. So, if text is not included in that isolated json line, it’s not directly related to each other. Therefor, the order of your json lines makes no difference to training or language interpretation at all. On the other hand, I have noticed that “learnings” from one json line definitely does influence the interpretation of other lines, for example, defining a label in one json line, and identifying that label in another. Again, it’s my understanding that the definition might as well be right at the bottom of the training file and it’s usage right at the start, it makes no difference.

Topic		Replies	Views
How to format context documents to allow model to recognize specific fields within documents API gpt-35-turbo , context-elements	5	4344	January 8, 2024
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8428	December 17, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	5772	May 1, 2024
Training with Large PDF FIles API	10	24443	December 15, 2023
How and what is the best way to break text into logical blocks? Prompting	5	2371	March 16, 2023

Document Cutting

Related topics