Yes that helps. Will implement and follow up if I have any additional questions.
Yes that helps. Will implement and follow up if I have any additional questions.
Hi Ray. It is generating very few questions and the typical user questions are not covered at all. is there any way we could train the model - I wanted to train the model by taking consecutive sentences as questions & answers. however my colleague insists it is futile exercise because user queries will not be as per the sentences mentioned in the text thereby similarity is minimal.
the web page data can be segregated into heading and context which is not the case with pdf documents that we push as the format is not standard and they dont have anything called heading really in processed text. what do you suggest
You may find that you are better off looking at embedding instead of fine-tuning
Check your private chat section for more info
I hear there is a really good course on the topic on Udemy
Can sombody please post the link to the udemy course or send it to me via pm?
Also check your private message for something a little bit extra
The embedding videos are all free on that course link. They are setup as “preview” videos - but they have the full content
Dear Raymond, Thank you very much for the support extended. Has been nice learning experience e-interacting with you.
I think this is my favorite article on all the community boards!
I have a related question, though.
This thread started with the recommendation for fine-tuning long text using (empty) prompts and completions of 1000 tokens. Is that still the best-practice?
Along those lines, @raymonddavey has completely convinced me that embedding is the much better option for this general use-case (adding additional information to GPT-3’s knowledgebase). Is there is a recommended optimal size for the chunks of large text that gets embedded? For example, if I have a document of 50,000 tokens, is the optimal size for embedding 2000 tokens, 1000 tokens, 500 tokens or one sentence?
The trade-offs that I see are:
Embed large chunks (e.g., whole books of the Bible) which will lower the overall cosine similarities between the query and the large text.
Cosign similarity takes the relative size of two pieces of text into account, and so (in my experiments) the cosine similarity between two pieces of text will be lower - all things considered - if one piece of text is much smaller than the other. And in my case, I’m comparing one question embedding (e.g., What is the purpose of prayer?) against perhaps the entire Bible looking for semantic matches with purposes and prayer.
Break the text into chapters of a few pages, which should increase the cosine similarity, but may not be granular enough to be useful (and still expensive if I have to send a chapter per query).
My concern is that if I measure the cosine similarity between one sentence and one chapter, the similarity score will still be markedly lower due at least in part because of the size difference between the two texts. In response,
I would think we should embed at the sentence or verse level, but that seems expensive, and likely to lose a lot of the context.
What is the best-practice for embedding size?
I just rewatched @raymonddavey’s awesome videos, and I think he might have answered my question for me. In the em007 video (Intro to CSV and Semantic Search) at approximately 1:35, Raymond mentioned that breaking up text into 1500 - 2000 words is normally a good choice. Just out of curiosity, where did that recommendation come from? What are the tradeoffs compared to breaking up text into e.g., 500 words? (Of course more embeddings and less context, but would that be offset by more targeted results from the semantic search)?
Thanks for the kind words about the course.
After a lot of experimentation, the best range appears to be about 350 to 500 tokens per block.
(This equates to 8.5% to 12% of the max_tokens for the model I ask the final question with - not the embedding model. If the model size increases, I would probably increase by a similar factor)
We combine paragraphs together by searching for headings and text follow on. If we can combine two or more paragraphs into one block, we do. We always restarted when we hit a major heading - even if the previous block was not full.
By doing this, we can include between 4 and 6 contexts when we ask the final question, and still leave enough room for the completion. The block are normally (but not always) the top hits from a semantic search. Sometimes we can fit more or less - it depends on the number of tokens you decide to use to provide context. We used between 30 and 50% (purely based on a cost decision by the user - when Davinci was still the only expensive option)
By including more contexts, we managed to get information from different parts of a single document - or (better yet) parts from multiple document sources. This really helped the AI provide a strong answer that was on topic.
Let me know if you need more info on what we did. Others may have done something different.
This is very helpful!
Part of the reason I ask is, as you know, I’m working on my thesis which includes a comparison of fine-tuning verses embedding. My hope was that I could break up the text in the same sized chunks. This way I can remove (different) sizes as a factor, and use the same set of 500-token blocks to feed to the fine-tuning, and then again directly into the embedding.
“We combine paragraphs together by searching for headings and text follow on. If we can combine two or more paragraphs into one block, we do. We always restarted when we hit a major heading - even if the previous block was not full.” <— This makes a lot of sense. In my case, it’s all conversational data, which has no textual, logical, or grammatical breaks, so in my case I just fill the blocks until the next sentence won’t fit.
This stuff is pretty fun!
Just out of curiosity, when would you submit a block of context that is not at top of the semantic search results?
Sometimes you get a strong hit with a bibliography page or table of contents, index or similar. You want to ignore these as they don’t contain useful data.
I’ve gotten a lot of really good advice from these forums, and I appreciate all your advice. I have yet another follow-up question (worry not; I’m sure it won’t be my last).
The consensus seems to be that the best completion size for tine-tuning on large text is 1000 tokens (as mentioned by @PaulBellow), and @raymonddavey mentioned that his experimentation revealed the optimal size for blocks of semantic texts to be 350 - 500 tokens. I’m doing this for my thesis, and you know the motto of Academia: Citations or it didn’t happen.
Does anyone know of any studies, research or white papers that suggest the optimal size for fine-tuning and context-injection blocks?
I think the 500 tokens comes from the observation that an idea is encapsulated in 1 to 3 paragraphs, and 500 tokens is up to 3 average paragraphs. The 1000 token completion just makes sense since it gives the model some room to breath and contain its output ideas to less than 5-10 ideas max. You probably don’t want the model to create too many more output ideas than input ideas, otherwise it can start to drift. At least that is my theory.
More discussion in this thread:
Have another chat with Ricardo. He has tried long context (5000 tokens) and short tokens (250 and 400 tokens) on GPT4 and their massive corpus.
He found the long embedding gave better results
But, a HUGE caveat: His use case is not normal. He is chaining queries.
Because each context is so long, he has to pass the output from the first run (that had a single context) as the input with the next context in the list. He asks GPT to improve the output from the first pass, with the new context in the second pass.
They are running 10 iterations or passes. (Or stop when they hit a minimum dot product value)
It means they are processing 50,000 tokens for a single query. Their corpus is huge, though - so this is just a small portion of the overall knowledge.
They did the same thing with smaller embedding and using multiple snippets to fill up the 5000 token of contexts, and the results were not as good - even though the context covered more data sources.
But it would be best if you chatted with Ricardo for more detailed info.
I use these funcitons to trim the text down to fit into the desired token count. I concatonate the text from the top five embedding matches then run it through this function to make it fit. (Iwish I could use tiktoken, but I’m on python 3.7 and it needs 3.8)
def count_tokens(vCountTokenStr): # Tokenize the input string blob = TextBlob(vCountTokenStr) tokens = blob.words # Count the number of tokens num_tokens = len(tokens) return num_tokens def fit_within_token_limit(text, token_limit): remaining_tokens = token_limit shortened_text = text while count_tokens(shortened_text) >token_limit: # Reduce the length of the text by 10% and try again shortened_length = int(len(shortened_text) * 0.9) shortened_text = shortened_text[:shortened_length] return shortened_text
Thank you for the insight and the suggestion! I reached out to Ricardo and look forward to receiving his advice.
Interestingly, along these lines, I’ve been researching ‘narrative segmentation’ which seeks to chop long texts into idea units, and found these two promising leads:
Paper (1) describes an interesting mechanism for using our semantic similarity measurement to find topic changes in texts. Paper (2) uses GPT-3 (of course!!) for segmentation, which would be very easy, but fairly expensive at scale.
I was curious if anyone had any experience semantically chopping long texts into smaller units.
You can try using this tool to convert HTML text to plain text: https://totheweb.com/learning_center/tools-convert-html-text-to-plain-text-for-content-review/