How to text crawling and text embeddings on huge websites sites

alvinliauw91 · May 11, 2023, 3:20am

Hi everyone i’m still new to chat GPT.

I am trying to create a chatbot that can answer and summarize the content of a website.
And i’m following the instruction here =

However as i tried to crawl and embed more pages of the website the accuracy dropped.
And i checked the cosine similiarity between the context chunks for the full data pages crawling and it’s very wrong although the cosine similiarity score is high.

It seems to me the more the pages i embed the lower the accuracy, anyone know how to fix this?

Chriss4123 · May 11, 2023, 11:24am

This should not be happening assuming you did everything correct in the programmatic side of things. Is everything split equally into chunks? If possible, you can provide a link to your code so others can try and help you.

alvinliauw91 · May 12, 2023, 2:10am

We are still following the tutorials in the OpenAI website (OpenAI API) with some modification. Below are the codes:

###CHUNKING

max_tokens = 500

def split_into_many(text, max_tokens=max_tokens):
pattern = r’(?:[\n.!?;])’
sentences = re.split(pattern, text)
sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]

chunks = []
tokens_so_far = 0
chunk = []

# Loop through the sentences and tokens joined together in a tuple
for sentence, token in zip(sentences, n_tokens):
    if tokens_so_far + token > max_tokens:
        chunks.append(" ".join(chunk) + ".")
        chunk = []
        tokens_so_far = 0

    if token > max_tokens:
        continue

    chunk.append(sentence)
    tokens_so_far += token + 1

if chunk:
    chunks.append(" ".join(chunk) + ".")

return chunks

df = pd.DataFrame(shortened, columns=[‘text’])
df[‘n_tokens’] = df.text.apply(lambda x: len(tokenizer.encode(x)))

for row in df.iterrows():
# If the text is None, go to the next row
if row[1][‘text’] is None:
continue

# If the number of tokens is greater than the max number of tokens, split the text into chunks
if row[1]['n_tokens'] > max_tokens:
    shortened += split_into_many(row[1]['text'])

# Otherwise, add the text to the list of shortened texts
else:
    shortened.append(row[1]['text'])

###END_OF_CHUNKING

###EMBEDDING

df[‘embeddings’] = df.text.apply(lambda x: openai.Embedding.create(input=x, engine=‘text-embedding-ada-002’)[‘data’][0][‘embedding’])

df.to_csv(‘processed/embeddings.csv’)
df.head()

###END_OF_EMBEDDING

###CONTEXT_GENERATION
def create_context(
question, df, max_len=1800, size=“ada”
):
“”"
Create a context for a question by finding the most similar context from the dataframe
“”"

# Get the embeddings for the question
q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

# Get the distances from the embeddings
df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


returns = []
cur_len = 0

for i, row in df.sort_values('distances', ascending=True).iterrows():
    
    # Add the length of the text to the current length
    cur_len += row['n_tokens'] + 4
    
    # If the context is too long, break
    if cur_len > max_len:
        break
    
    # Else add it to the text that is being returned
    returns.append(row["text"])

# Return the context
return "\n\n###\n\n".join(returns)

###END_OF_CONTEXT_GENERATION

Topic		Replies	Views
OpenAI Prototype for Website Q&A with Embeddings does not work Community chatgpt , api	6	977	August 28, 2023
How to let chatgpt fully digest a really large text? API	7	10576	December 16, 2023
Web Q&A embeddings - turorial API embeddings	5	1467	February 15, 2024
Embeddings results using Ada-Embedding-data-002 API	10	2490	March 29, 2023
Passing webpages to GPT-3 API	9	7491	December 10, 2021

How to text crawling and text embeddings on huge websites sites

Related topics