The length of the embedding contents

@raymonddavey

Genius, since I plan on using ada-002 for embedding, just pull the value from that and put it in the database next to the embedding vector! GPT-4 probably even uses the same tokenizer!

3 Likes

I can confirm they are the same tokenizer

And that is what I do too

But now my c# tokenizer gets me within a few tokens. They seem to have an overhead for each message in the new chat protocol

2 Likes

This looks really amazing!
You saying this:

“I have a classifier that determines if the question is a “general” or a “specific” question.”

Is there any document or sample that you can suggest for me to do something like this classifier? I use title and subtitle when indexing embeddings to the vector database, do you think this makes sense? Are indexes important for making semantic search if not, giving a unique random id may be the solution.

1 Like

it may be necessary to use metadata in the prompt, but when creating embeddings, if there is important data in the metadata that can dealing the correct embedding with the user question, I think you should include some important metadata in the embeddings. Because I think that in order to dealing the question with the embedding, that metadata must have been included in the embedding.

1 Like

GPT wrote code like this:

# Function to split a DataFrame into chunks based on the text column length
def split_dataframe(df: pd.DataFrame, window_size: int = 2000, overlap: int = 1000) -> List[pd.DataFrame]:
    chunks = []

    for _, row in df.iterrows():
        text = row['text']
        start = 0

        while start < len(text):
            end = start + window_size

            # Move end to the right until whitespace is encountered
            while end < len(text) and not text[end].isspace():
                end += 1

            chunk = text[start:end]
            if len(chunk) > 0:
                chunks.append(chunk)

            # Move start to the right by window_size - overlap
            start += window_size - overlap
            # Move start to the right until whitespace is encountered
            while start < len(text) and not text[start - 1].isspace():
                start += 1

    return chunks

Thank you for sharing. Would love to hear some feedback on if this strategy works well once you built it

You should prompt something like this:

This code got evaluated for a score of 3.25 can you enhance that to 7.5 please?

[the code]

66% overlap? Isn’t that too much? Wouldn’t 33% be enough?
How much did it cost you to “embed” your books with your technique? I can’t find any data on this, can you please give us more details?

Assuming the average book has 100,000 words, this is 133,333 tokens. Embedding costs $0.0001 per 1000 tokens (with ada-002 current pricing). There are roughly 133 different 1000 token chunks in a book. So the cost is 133*0.0001 or roughly $0.01.

A 50% overlap would double this to $0.02.

The cost is nothing!

You will spend more on database and server costs. But you can still get this down to a few cents (or a few bucks, depending on usage) per month per book if you do it smartly and avoid expensive vector databases.

1 Like

Hey guys,

Just sharing my strategy on this (maybe saving someone’s day):

  1. Get raw text as a string (full text)
    1.1 Normalize the string (fix spaces, line ends, remove empty lines etc.)
  2. Split in chunks on lines ending with sentence stop punctuation (. ! ? ." !" ?") using regex (this way you’re most likely to chop text at the actual paragraph end, very needed especially if dealing with PDFs copy paste)
    2.1 check the chunk length if over the model cap, try to split it on sentence end punctuation.
    3 Using fine-tuned model, run reach chunk through a “formatter” - the goal is to make sure each title and list item are on their own lines (to separate them from simple paragraphs)
    4 join the chunks back again and split on line ends to get each line separately.
    5 run reach line through a fine-tuned classifier to determine if the line is a:
  • title
  • list item
  • paragraph
  • document meta (page number, doc date, version etc,)
    6 starting from the first classified line towards the end apply simple algorithm:
  • start with a new section if the line is a title or document meta,
  • add the current line to the current section of it is paragraph or list item, start new section if not.
    At this point you will end up with logical sections that either start with its title or have no title and describe the document (doc meta)
    7 Check if the section fits your target size for embedding/retrieval or needs further split. You may split them same way as in #2
    8 embed the section (or its part) along with the title and ID/NUMBER

When finding the sections, use the ID/NUMBER to find adjacent sections if need wider context and it fits into your answering model prompt.

3 Likes

This is my approach: https://youtu.be/w_veb816Asg

And, not only am I including metadata in my embeddings, but I also generate questions that each document answers, to sort of “spike” the contextual intent of the content.

For me, 2500 character chunks has been working, but as @AgusPG says, it really, really depends on your use case, type of documents, anticipated questions, desired responses, etc…

Good luck!

This is VERY good advice, because now you are chunking based upon the semantic structure of your source document, as opposed to arbitrary cuts in your document based upon chunk size. This is guaranteed to give you much better results.

Hi, new to all this so trying to get my head round it - I think I get the idea but one thing I don’t understand is how the prompt reads the metadata to see each piece as part of a whole?

Ask the model to output markdown.

That way you don’t have a “hard” time with the regex.

I include the metadata in my context documents.

My metadata is stored in Weavate with the vectors. So, when the cosine similarity computation is done and I’ve got my context vectors, I construct each document to include: content, title, url, etc… Whatever I believe the LLM needs to bring back the best answer.

I also optionally include a summary of the source document which can be included in each chunk. Using this and the title helps the LLM see individual chunk as part of the whole.

1 Like

How do you mark that within your context to upload? I upload plain text which is converted to a vector rather than a json.

This is one way:

Using this code to construct the returned vectors into context docs:

        // Iterate over the results array to extract the relevant elements
        foreach ($results as $index => $result) {
            // Extract the 'content', 'date', 'groups', and 'taxonomy' elements for each row
            $contextDocument = $result['content'];
            $documentTitle = isset($result['title']) ? $result['title'] : '';
            $documentSummary = isset($result['summary']) ? $result['summary'] : '';
            $documentDate = isset($result['date']) ? $result['date'] : '';
            $documentGroups = isset($result['groups']) ? implode(', ', $result['groups']) : '';
            $documentTaxonomy = isset($result['taxonomy']) ? implode(', ', $result['taxonomy']) : '';
            $documentURL = isset($result['url']) ? $result['url'] : '';
            $documentQuestions = isset($result['questions']) ? $result['questions'] : '';

            // Construct the context document string with labeled elements
            $documentString = "Document Title: '{$documentTitle}'\n";
            $documentString .= "Content: {$contextDocument}\n";
            if ($this->includeSummary === true ) {
                $documentString .= "Source document summary: {$documentSummary}\n";
            }
            $documentString .= "Event Date: {$documentDate}\n";
            $documentString .= "Document Groups: {$documentGroups}\n";
            $documentString .= "Document Taxonomy/Tags: {$documentTaxonomy}\n";
            $documentString .= "URL: {$documentURL}\n";
            if ($this->includeQuestions === true) {
                $documentString .= "Questions that this document answers: {$documentQuestions}\n";
            }
            
            # Debug
            # $this->newOutput .= "Document Title: {$documentTitle}. " . "<br>";        

            // Append the context document string to the prompt content
            $promptContent .= "Context document " . ($index + 1) . ": {$documentString}\n";
            $promptContent .= "-----\n"; // Delimiter to separate context documents
        }

Then,


			// Build the prompt containing question and context documents
			$prompt = $this->solrai_createPromptContent($question, $context);
			# Debug
			# print_r($context) . "<br>";
			# print_r($prompt) . "<br>";
			
			// Initialize the $messages array with the system message
			$messages = array(
				array("role" => "system", "content" => $systemMessage)
			);

			// Define the new user message (question + context docs)
			$newUserMessage = array("role" => "user", "content" => $prompt);
			
			// Append the new user message to the end of the $messages array
			$messages[] = $newUserMessage;
			
			// Get the chat completion with history from the LLM
			$result = $this->solrai_getChatCompletion($messages, $apiOpenAI);

This is php. If you are using python or something else, just ask ChatGPT to translate.

Another thing you may want to consider, depending upon your use case and your users, is to give the end-user options to make choices at the prompt:

1 Like

I am using WordPress so php would be perfect, I haven’t built the prompt though as such - I’m using a plugin called AI Engine so not sure if this would work alongside - thank you though, really helpful.

1 Like

Hey Newbs, it doesn’t sound like you are confused to me. I get the impression from reading this thread that there are some unspoken and differing assumptions on how the vectors are created and compared. Based on my understanding of how the vectors are created and compared however, I agree entirely with your line of questioning (and reading between those lines, your understanding of how the vectors work).

This seems a very thoughtful process, thanks for sharing it.

Do you have any reference code where you have tried this approach?

Thanks