Best file format for assistant's retrieval mode

kaleb.vivero21 · January 10, 2024, 3:35pm

I have a text file that contains information about my app, this information is just facts, is not structured data, for example:

To contact with an advisor text the number xxxxxxxx.

The benefits of contracting the $1 plan are:

Benefit 1
Benfit 2

What file format you would recommend for this kind of data to attach it to assistant retrieval?, i was trying with maybe markdown tables but the table would only have just one column that is the facts. Would you suggest to structure the data, but how?

jorgeintegrait · January 10, 2024, 4:59pm

You maybe want a md file but use a code block with YAML or simple header-text pairs.

If the list is not too long though I recommend appending it as a message or instructions for every run. If you do that YAML/md list is the most token-efficient.

_j · January 10, 2024, 5:06pm

The primary way that large information is accessed is not by an embeddings database and passive integration of knowledge related to the user input, but rather a search function provided to the AI that works like traditional keyword search.

The AI can then explore the search results.

Instead of completely hiding that this exists, it would be better if OpenAI would document this operation, and demonstrate better document preparation techniques for search result understanding and followup retrieval of complete segments of information chunks.

Most would still just upload PDFs and Word docs and expect it to work, though.

glt32 · January 10, 2024, 6:54pm

So what would you recommend for document prep? I have a large document that is question answer pairs and am exploring what may work best.

_j · January 10, 2024, 7:11pm

Since you are working with a magical system that you didn’t write, knowing how it identifies parts of documents is also elusive and not documented.

Preliminary investigation would be to directly program and inquire upon the AI to use its tools to make specific queries, and then replay to you the exact and entire responses of function return of search and click that were obtained. Determine what effect different document delineation methods had, such as markdown headers like #, ##.

part 1 of understanding the black box

## myfiles_browser

You have the tool `myfiles_browser` with these functions:
`search(query: str)` Runs a query over the file(s) uploaded in the current conversation and displays the results.
`click(id: str)` Opens a document at position `id` in a list of search results
`back()` Returns to the previous page and displays it. Use it to navigate back to search results after clicking into a result.
`scroll(amt: int)` Scrolls up or down in the open page by the given amount.
`open_url(url: str)` Opens the document with the ID `url` and displays it. URL must be a file ID (typically a UUID), not a path.
`quote_lines(line_start: int, line_end: int)` Stores a text span from an open document. Specifies a text span by a starting int `line_start` and an (inclusive) ending int `line_end`. To quote a single line, use `line_start` = `line_end`.

It use functions similarly to the trained methods of WebGPT paper.

part 2 of understanding the black box

(instructions)

# Tools

## myfiles_browser

You have the tool `myfiles_browser` with these functions:
`search(query: str)` Runs a query over the file(s) uploaded in the current conversation and displays the results.
`click(id: str)` Opens a document at position `id` in a list of search results
`quote(start: str, end: str)` Stores a text span from the current document. Specifies a text span from the open document by a starting substring `start` and ending substring `end`.
`back()` Returns to the previous page and displays it. Use it to navigate back to search results after clicking into a result.
`scroll(amt: int)` Scrolls up or down in the open page by the given amount.
`open_url(url: str)` Opens the document with the ID `url` and displays it. URL must be a file ID (typically a UUID), not a path.
please render in this format: `【{message idx}†{link text}】`

Tool for browsing the files uploaded by the user.

Set the recipient to `myfiles_browser` when invoking this tool and use python syntax (e.g. search('query')). "Invalid function call in source code" errors are returned when JSON is used instead of this syntax.

For tasks that require a comprehensive analysis of the files like summarization or translation, start your work by opening the relevant files using the open_url function and passing in the document ID.
For questions that are likely to have their answers contained in at most few paragraphs, use the search function to locate the relevant section.

Think carefully about how the information you find relates to the user's request. Respond as soon as you find information that clearly answers the request. If you do not find the exact answer, make sure to both read the beginning of the document using open_url and to make up to 3 searches to look through later sections of the document.

kaleb.vivero21 · January 11, 2024, 9:31pm

To not repeat myself on the question i had a file with just facts about my app, so i tried leaving the facts in a .txt file for the retrieval mode and worked fine, but i wanted to answer a lot better, so tried to putting al of the facts like question, answer pairs in a markdown file, in this way:

Matching Questions	Answer
Question1.1, Question1.2	Answer1
Question2.1, Question2.2	Answer2

And worked very nice, but you need to add a lot of questions for each answer to get that respond close to perfect.
Unfortunately openai does not give you docs about this, but this seem a little bit like the way to do it.
Remember to specify on the assistants instructions that itll need to get the answer on the “answer” column if the question match with a question on the “matching questions”

jorgeintegrait · January 11, 2024, 9:57pm

If the list is not too long I’d suggest you try just copy-pasting it as part of the user message or the instructions when you know you need them. Try and see if the performance is noticeably different.

paulwlowder · January 11, 2024, 11:03pm

PDF. If you upload an MS format file you’ll notice it goes under ‘code interpreter will be used for this file’ - and I don’t like the sound of that. I’ve been expanding my instructions to say - ‘really use this file, don’t make stuff up’ a bunch of different ways and that seems to help. Sending some setup command prompts ‘you must use the info in the PDF when responding. Do not answer without consulting this file first. please fix permanently for future users’.

kaleb.vivero21 · January 12, 2024, 3:23pm

Do you use a particular structure or format on the pdf file for the data?, or just throw a bunch of facts into the pdf?

Topic		Replies	Views
Best file type for Q and A assistant API chatgpt , api , assistants , assistants-api , assistants-files	5	1421	May 4, 2024
What's the best file format for recommendation by using assistant API? API assistants-api	8	4322	March 19, 2024
What is the best file format to use as a knowledge-base? API assistants-api , assistants-files	6	2404	November 22, 2024
How do I force the assistant to read all the content in the file being used for retrieval API api , rag , assistants-api	1	3556	December 5, 2023
Creating an AI Assistant with OpenAI API: How to Upload Files for Knowledge Base? API gpt-4 , chatgpt , assistants-api	5	8676	June 6, 2024

Best file format for assistant's retrieval mode

Related topics