Document Library Pre-Processing

heerschap.corey · January 18, 2023, 7:19pm

Hello all,

I’ve been trying to complete the question and answering tutorial Question_answering_using_embeddings.ipynb using my own dataset. The issue is even though I downloaded the example csv file, and copied my own data into the csv file and re-saved it, I cannot get the dataset to run using the code. The code runs perfectly fine if I use the sample data set but when I try to run the sample data set with my data replacing it (even ensuring it is always saved as a csv file), it always errors past line 48. I have tried changing the data types in the columns using python, I’ve tried removing any special characters, I’ve tried removing any special characters, I’ve tokenized by hand using the OpenAI website, I’ve tried, I kid you not, about three days worth of fixes with no luck. ChatGPT is now repeating recommendations without any success unfortunately.

I continually receive this error:

`ValueError Traceback (most recent call last)
Cell In [74], line 1
----> 1 prompt = construct_prompt(
2 “What is a WOC Nurse?”,
3 document_embeddings,
4 df
5 )
7 print(“===\n”, prompt)

Cell In [73], line 16, in construct_prompt(question, context_embeddings, df)
13 document_section = df.loc[section_index]
15 chosen_sections_len += document_section.tokens + separator_len
—> 16 if chosen_sections_len > MAX_SECTION_LEN:
17 break
19 chosen_sections.append(SEPARATOR + document_section.content.replace(“\n”, " "))

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:1442, in NDFrame.nonzero(self)
1440 @Final
1441 def nonzero(self):
→ 1442 raise ValueError(
1443 f"The truth value of a {type(self).name} is ambiguous. "
1444 “Use a.empty, a.bool(), a.item(), a.any() or a.all().”
1445 )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
`

You can see in the dataset found here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSs9Ok5FUrhAOu_BnpLwV63bwpLylRtUWBDE7onAX1zrZW0Sz4gBEtBN-KtsBiC1DhKyhhZjNXfNf0i/pub?output=csv

That if you only use the first chapter, there is no issue, however, anything read past line 48 (it took a lot of trial and error to determine this) it no longer works and I either get the error noted above, or an error stating that the system cannot read the JSON content (I’ve also tried converting the csv file to json, but still obtain the error above).

Unfortunately, I am still quite new to python so any recommendations or assistance with this issue would be much appreciated.

Thank you so much for your assistance!

ronnysetiawandev · January 30, 2023, 7:18am

Hi… if i’m not mistaken, it’s because we have row with the same title & heading…

cottoncandy · February 9, 2023, 5:00am

I have the same bug. Strange thing is when I split the csv file into several small ones, then run the same script on it, it’s able to answer questions without issue. But when I combine all of the rows back into the same file, it fails with the “ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().” error again

I also made sure nothing in title matches anything in heading.

Any thoughts on how to fix?

evanmcfarland · February 9, 2023, 5:04pm

I had the same error. The fix is to not have same value two rows of the column you identify and pull from for embedding.

For more detail, I’ll explain in the context of the OpenAI CookBook Example: Question answering using embeddings-based search | OpenAI Cookbook

If there are two of the same values in the ‘heading’ column for one ‘title’, since Python uses iloc() to extract the row that it needs ‘content’ from, it doesn’t know which of the rows it’s supposed to choose.

So just check for duplicates in the ‘heading’ column before running the program and you should be good.

cottoncandy · February 9, 2023, 10:34pm

Ah, I thought I had deduped, but there was 2 rows that had the same title-heading pair.

That fixed it. Thank you!

Topic		Replies	Views
Fine tune error in using openai tool to parse JSONL file API fine-tuning , api	13	983	June 30, 2023
Preparing csv -> All prompts are identical API	1	1166	March 22, 2023
How to load saved embeddings API	9	4216	May 4, 2023
ERROR in read_any_format validator: while trying to use openai tools fine_tunes.prepare_data API	5	2191	May 2, 2023
Valid CSV file and ERROR in read_any_format validator: does not appear to be in valid CSV format API	7	2460	April 15, 2024

Document Library Pre-Processing

Related topics