Thank you for all the knowledge you shared. I have read with attention your embedding course but I canât find out my answers. Iâm sure you got it though
The course is all about text comparison, proximity⌠but never about âtext generationâ based on specific corpus.
I use chatGPT to as questions or summarize articles/books abstracts that i push directly in prompt saying âPlease summrize this :â or âaccording to this text + my questionâ. And It works pretty fine. But how to do this for 10 to 50k words of context ?
My aim is to prepare multiple datasets about specific subjects (10 to 50k words) and then use GPT for text generation/summarization. Is there a way to do that ?
Hi Ray. It is generating very few questions and the typical user questions are not covered at all. is there any way we could train the model - I wanted to train the model by taking consecutive sentences as questions & answers. however my colleague insists it is futile exercise because user queries will not be as per the sentences mentioned in the text thereby similarity is minimal.
the web page data can be segregated into heading and context which is not the case with pdf documents that we push as the format is not standard and they dont have anything called heading really in processed text. what do you suggest
I think this is my favorite article on all the community boards!
I have a related question, though.
This thread started with the recommendation for fine-tuning long text using (empty) prompts and completions of 1000 tokens. Is that still the best-practice?
Along those lines, @raymonddavey has completely convinced me that embedding is the much better option for this general use-case (adding additional information to GPT-3âs knowledgebase). Is there is a recommended optimal size for the chunks of large text that gets embedded? For example, if I have a document of 50,000 tokens, is the optimal size for embedding 2000 tokens, 1000 tokens, 500 tokens or one sentence?
The trade-offs that I see are:
Embed large chunks (e.g., whole books of the Bible) which will lower the overall cosine similarities between the query and the large text.
Cosign similarity takes the relative size of two pieces of text into account, and so (in my experiments) the cosine similarity between two pieces of text will be lower - all things considered - if one piece of text is much smaller than the other. And in my case, Iâm comparing one question embedding (e.g., What is the purpose of prayer?) against perhaps the entire Bible looking for semantic matches with purposes and prayer.
Break the text into chapters of a few pages, which should increase the cosine similarity, but may not be granular enough to be useful (and still expensive if I have to send a chapter per query).
My concern is that if I measure the cosine similarity between one sentence and one chapter, the similarity score will still be markedly lower due at least in part because of the size difference between the two texts. In response,
I would think we should embed at the sentence or verse level, but that seems expensive, and likely to lose a lot of the context.
I just rewatched @raymonddaveyâs awesome videos, and I think he might have answered my question for me. In the em007 video (Intro to CSV and Semantic Search) at approximately 1:35, Raymond mentioned that breaking up text into 1500 - 2000 words is normally a good choice. Just out of curiosity, where did that recommendation come from? What are the tradeoffs compared to breaking up text into e.g., 500 words? (Of course more embeddings and less context, but would that be offset by more targeted results from the semantic search)?
After a lot of experimentation, the best range appears to be about 350 to 500 tokens per block.
(This equates to 8.5% to 12% of the max_tokens for the model I ask the final question with - not the embedding model. If the model size increases, I would probably increase by a similar factor)
We combine paragraphs together by searching for headings and text follow on. If we can combine two or more paragraphs into one block, we do. We always restarted when we hit a major heading - even if the previous block was not full.
By doing this, we can include between 4 and 6 contexts when we ask the final question, and still leave enough room for the completion. The block are normally (but not always) the top hits from a semantic search. Sometimes we can fit more or less - it depends on the number of tokens you decide to use to provide context. We used between 30 and 50% (purely based on a cost decision by the user - when Davinci was still the only expensive option)
By including more contexts, we managed to get information from different parts of a single document - or (better yet) parts from multiple document sources. This really helped the AI provide a strong answer that was on topic.
Let me know if you need more info on what we did. Others may have done something different.
This is very helpful!
Part of the reason I ask is, as you know, Iâm working on my thesis which includes a comparison of fine-tuning verses embedding. My hope was that I could break up the text in the same sized chunks. This way I can remove (different) sizes as a factor, and use the same set of 500-token blocks to feed to the fine-tuning, and then again directly into the embedding.
âWe combine paragraphs together by searching for headings and text follow on. If we can combine two or more paragraphs into one block, we do. We always restarted when we hit a major heading - even if the previous block was not full.â <â This makes a lot of sense. In my case, itâs all conversational data, which has no textual, logical, or grammatical breaks, so in my case I just fill the blocks until the next sentence wonât fit.
Sometimes you get a strong hit with a bibliography page or table of contents, index or similar. You want to ignore these as they donât contain useful data.
Iâve gotten a lot of really good advice from these forums, and I appreciate all your advice. I have yet another follow-up question (worry not; Iâm sure it wonât be my last).
The consensus seems to be that the best completion size for tine-tuning on large text is 1000 tokens (as mentioned by @PaulBellow), and @raymonddavey mentioned that his experimentation revealed the optimal size for blocks of semantic texts to be 350 - 500 tokens. Iâm doing this for my thesis, and you know the motto of Academia: Citations or it didnât happen.
Does anyone know of any studies, research or white papers that suggest the optimal size for fine-tuning and context-injection blocks?
I think the 500 tokens comes from the observation that an idea is encapsulated in 1 to 3 paragraphs, and 500 tokens is up to 3 average paragraphs. The 1000 token completion just makes sense since it gives the model some room to breath and contain its output ideas to less than 5-10 ideas max. You probably donât want the model to create too many more output ideas than input ideas, otherwise it can start to drift. At least that is my theory.
Have another chat with Ricardo. He has tried long context (5000 tokens) and short tokens (250 and 400 tokens) on GPT4 and their massive corpus.
He found the long embedding gave better results
But, a HUGE caveat: His use case is not normal. He is chaining queries.
Because each context is so long, he has to pass the output from the first run (that had a single context) as the input with the next context in the list. He asks GPT to improve the output from the first pass, with the new context in the second pass.
They are running 10 iterations or passes. (Or stop when they hit a minimum dot product value)
It means they are processing 50,000 tokens for a single query. Their corpus is huge, though - so this is just a small portion of the overall knowledge.
They did the same thing with smaller embedding and using multiple snippets to fill up the 5000 token of contexts, and the results were not as good - even though the context covered more data sources.
But it would be best if you chatted with Ricardo for more detailed info.