Document Library Pre-Processing

Hello all,

I’ve been trying to complete the question and answering tutorial Question_answering_using_embeddings.ipynb using my own dataset. The issue is even though I downloaded the example csv file, and copied my own data into the csv file and re-saved it, I cannot get the dataset to run using the code. The code runs perfectly fine if I use the sample data set but when I try to run the sample data set with my data replacing it (even ensuring it is always saved as a csv file), it always errors past line 48. I have tried changing the data types in the columns using python, I’ve tried removing any special characters, I’ve tried removing any special characters, I’ve tokenized by hand using the OpenAI website, I’ve tried, I kid you not, about three days worth of fixes with no luck. ChatGPT is now repeating recommendations without any success unfortunately.

I continually receive this error:

`ValueError Traceback (most recent call last)
Cell In [74], line 1
----> 1 prompt = construct_prompt(
2 “What is a WOC Nurse?”,
3 document_embeddings,
4 df
5 )
7 print(“===\n”, prompt)

Cell In [73], line 16, in construct_prompt(question, context_embeddings, df)
13 document_section = df.loc[section_index]
15 chosen_sections_len += document_section.tokens + separator_len
—> 16 if chosen_sections_len > MAX_SECTION_LEN:
17 break
19 chosen_sections.append(SEPARATOR + document_section.content.replace(“\n”, " "))

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:1442, in NDFrame.nonzero(self)
1440 @Final
1441 def nonzero(self):
→ 1442 raise ValueError(
1443 f"The truth value of a {type(self).name} is ambiguous. "
1444 “Use a.empty, a.bool(), a.item(), a.any() or a.all().”
1445 )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
`

You can see in the dataset found here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSs9Ok5FUrhAOu_BnpLwV63bwpLylRtUWBDE7onAX1zrZW0Sz4gBEtBN-KtsBiC1DhKyhhZjNXfNf0i/pub?output=csv

That if you only use the first chapter, there is no issue, however, anything read past line 48 (it took a lot of trial and error to determine this) it no longer works and I either get the error noted above, or an error stating that the system cannot read the JSON content (I’ve also tried converting the csv file to json, but still obtain the error above).

Unfortunately, I am still quite new to python so any recommendations or assistance with this issue would be much appreciated.

Thank you so much for your assistance!

Hi… if i’m not mistaken, it’s because we have row with the same title & heading…

I have the same bug. Strange thing is when I split the csv file into several small ones, then run the same script on it, it’s able to answer questions without issue. But when I combine all of the rows back into the same file, it fails with the “ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().” error again

I also made sure nothing in title matches anything in heading.

Any thoughts on how to fix?

I had the same error. The fix is to not have same value two rows of the column you identify and pull from for embedding.

For more detail, I’ll explain in the context of the OpenAI CookBook Example: Question answering using embeddings-based search | OpenAI Cookbook

If there are two of the same values in the ‘heading’ column for one ‘title’, since Python uses iloc() to extract the row that it needs ‘content’ from, it doesn’t know which of the rows it’s supposed to choose.

So just check for duplicates in the ‘heading’ column before running the program and you should be good.

1 Like

Ah, I thought I had deduped, but there was 2 rows that had the same title-heading pair.

That fixed it. Thank you!

1 Like