OpenAI Embeddings - use case

Hi There,

I am working on a use case where I have used chatgpt turbo-3.5 + embeddings combination to answer questions from the pdf data supplied. I am facing two issues there

  1. When there are more than 1 match in embeddings then the response is the first item in the list instead I am looking for a solution where the user should be prompted for options and then give an accurate response.

  2. Assume that they’re admission brochure for a school that has all informations for their courses on different pages. Information would include duration, fee, requirements etc. a simple query like “what are the courses in your school that are below $4000 tuition fee?” Would not return me correct data.

Am I doing something wrong while implementing embeddings?


1 Like

Are you using someone elses embeddings library/service?

I am using open ai embeddings with my own layer of vector database

In my app, I have embeddings for catalogue of furniture. If I ask the AI to give me all furniture made of oak, it gives me all the items made of oak.

If I understand your problem, in your case, your pdf contains all courses and it has several courses that applies to your query but you only get the first one as result? Does processing for embedding already gives you one result or all the results? Does the chat completion only gives you one result despite you sending all the items?

Here is some sample data

Page 10

Undergraduate course requirements for course 1

Duration : 20 hrs
Tuition : $200 per hour

Page 20

Graduate course requirement for course 1

Duration : 60 hrs
Tuition : $300 per hour

Simple queries like blow donot respond consistently

  1. What are requirements for course 1
    Exception: are you looking for graduate or undergraduate and the give the respective answer

  2. What is the total tuition fee?
    Expectation: sime multiplication for fee x duration

But it doesn’t give results

Nice :muscle:t2:
You could decide an absolute score, above which all matches get evaluated, or say anything within [x amount] of the top match also gets evaluated.

1 Like

I tested your sample data and populated it with sample course items. I used Computer Science as the course for testing (BS CS and MSCS, respectively).

If I ask what are the requirements for computer science, it gives me both.
If I ask whether BS Computer Science or MS Computer Science it gives me the correct one.

As for your expectation, I added in the system prompt that if the course given in the query is vague and not clear whether undergraduate or graduate program that it should ask the user to clarify. I tested using the first query again, what are the requirements for computer science and it replies as you expected, asking me which one I want.

For the 2, it tells me that there is no info. You probably can add this info in the PDF or do more processing (e.g. use function calling to check if user wants total tuition fee, run embeddings search for all course requirements, compute manually, send to chat completion, etc.)

Ok great !!

Could you please share the prompt you created. Will tweak mine accordingly.

Here’s the system prompt I used:

const system_prompt = `You are a helpful assistant. Try to answer the question from the user using the content of the file extracts below, and if you cannot answer, or find a relevant file, just output \"I couldn't find the answer to that question.\".\n\n` +
            `If the answer is not contained in the files or if there are no file extracts, respond with \"I couldn't find the answer to that question.\" If the question is not actually a question, respond with \"That's not a valid question.\"\n\n` +
            `Do not mention the name of any files or the existence of the files in your answer.\"\n\n` +
            `If the course mentioned is not clear whether it is for Undergraduate or Graduate program, ask the user which one they want.\n\n` +

Here is the sample data I used based on the one you gave:

Undergraduate course requirements for Computer Science (BSCS)

Name: Foundation Course in Mathematics
Units: 5
Duration : 20 hrs
Tuition : $200 per hour

Name: Foundation Course in Statistics
Units: 3
Duration : 10 hrs
Tuition : $150 per hour

Code: CCPROG1 
Name: Logic Formulation and Introductory Programming
Units: 3
Duration : 12 hrs
Tuition : $250 per hour

Code: CCPROG2 
Name: Programming with Structured Data Types
Units: 3
Duration : 15 hrs
Tuition : $230 per hour

Graduate course requirements for Computer Science (MSCS)

Code: CEDISP1 
Name: Digital Signal Processing 1
Units: 4
Duration : 60 hrs
Tuition : $300 per hour

Name: Microprocessor Interfacing
Units: 4
Duration : 50 hrs
Tuition : $320 per hour

Name: Multiprocessing and Parallel Computing
Units: 3
Duration : 55 hrs
Tuition : $250 per hour


Hey @Klassdev ,

Just to get some more context, do you except to get the information to be correct one the first answer or is it possible for users/you to as for follow up questions?

You can also chain multiple prompts to get to the answers you want. Feel free to share the PDF you have so we can experiment some more. You can also use (webapp) to maybe chain some responses together for quick prototyping.

Let me know if you want to share some more context.

That is correct, I also wasn’t giving enough context and simply wasn’t understood(
Besides this I must note that I still prefer to use because here I can hire writers for essay, this seems to me a better way, because at the moment AI seems to me still quite poorly developed and it cannot give the full answer I would have needed.

Hi Green,

Thanks for your response.

I am ok for the AI bot to ask followup question if there are multiple matches found for the same query.

I donot expect the bot to answer in first go. Accuracy is the goal here even after follow questions by the bot.

Hope this helps.


Cool, you got an example PDF? Would love to check if we are getting better results when following up on the results.

Here is the sample PDF @Green

there following questions not being answered properly

  1. what are the age requirements for admission?
    Expectation: it should ask which curriculum are you lookin for?

  2. What is the fee for Grade XI?
    Expectation: It should ask- “please select from following options”

Grade XI (Commerce and Humanities Stream)
Grade XI (Science Stream without Comp. Science)
Grade XI (Science Stream with Comp. Science)

Hope this helps


Hi, may I ask how you are parsing the pdf in the first place? Is the whole pdf converted into a single vector embedding, or do you divide the pdf by page or by paragraph first?

It is passed page by page. And I am store page wise vector in custom vector db

Hey Klass,

I’m getting pretty good results with this prompt could you check if you argee?

You are an assistent that tries to navigate a student pick a education. 

Their are multiple courses so you may want to ask what kind of study or grade the user is looking for but ((not at the start of the conversation)). You will only provide choices that are available within the given information below the '---' line. If you are unsure or the question is too broad you will ask for classification.

1. Follow this prompt closely
2. The student has a very low cognitive load, dont overload the user with information.

When you are ready (((only))) say: "Hello, where can i help you with today?".


Thanks @Green : I will give it a try

Embeddings are notoriously bad at exact values.
Trying to use an embedding to “query for classes costing less than $4000” is exactly what they’re not good at. A regular database is much better at that.
Even keyword search is hit-or-miss with embeddings.

One solution to this problem is to combine keyword-based retrieval and embedding-based retrieval.
You could do both, generate a number of information chunks based on the question, and then provide all the chunks to the model, and ask the model to summarize the results in context of the user’s question.

Another solution is to use the new function call support. You can describe functions for “list classes with cost in range” and “list classes based on topic” and so on, and then let the chat with the model fill in the parameters, until the model is confident that it wants to invoke a function, and then you evaluate that function on your own side to retrieve the list.


Agree with @jwatte take on this, When you go from a traditional database format to vector embeddings, you are swapping determinism for statistical, now that statistical system can be very accurate and it is ground-breaking in its ability to translate what people “meant to say” into what they actually want, but it is also going to suffer from bluetoothism, that is, not everything needs bluetooth.