API Support for Retrieving Most Similar Embeddings from Preprocessed PDF Files

Here there’s a three-step typical process:

Preprocessing (Step 1): I’ve uploaded PDF files to an OpenAI assistant, which has automatically chunked the PDFs into extracts and computed embedding vectors for each chunk.

Query Processing (Step 2): When I issue a prompt to OpenAI, it computes the embedding of my query, search through the previously computed embeddings from the PDF files, and return the 20 most similar embeddings along with their corresponding extracts.

Response Generation (Step 3): Based on the results from Step 2, OpenAI constructs an answer using the relevant extracts and my query.

My focus is specifically on Step 2. Does OpenAI provide an API function that, given my query, can return the 20 most similar embeddings and their corresponding extracts from the preprocessed PDF files at Step 1?

Note: I do not wish to extract the text from the PDFs and divide them into chunks manually, as this preprocessing step has already been completed by OpenAI.

I have searched online but could not find a definitive answer and would appreciate any guidance or references to relevant documentation or examples.

Thank you,
David

There is no direct OpenaI API endpoint achieves that.


Independent of the OpenAI Assistant API, there’s a number of resources as part of OpenAI’s cookbook collection you can look up to understand how to achieves this programmatically. Here’s a couple of of them:

Apologies if I am misunderstanding the angle of your question given you are asking in the context of the Assistant API.