What am I doing wrong on my semantic search JSON embeded?

my json file has inside over 500 rows, and each row looking like this
“”"
“content”: “{‘Name’: ‘’, ‘Category’: ‘’, ‘Sub-Category’: ‘’, ‘City’: ‘’, ‘Latitude’: ‘’, ‘Longitude’: ‘’, ‘Address’: ‘’, ‘filename’: [‘’], ‘text’: ‘’}”,“vector”: "
“”"
Is there a way to be able to search for keywords or combine each category while doing a semantic search? or should I use a different method for openai to detect the content?

1 Like

When searching for short phrases and keywords, vector-based semantic searches yield poor results.

You will get better results with traditional DB keyword searches and full-text searches for keywords and short phrases and short strings of text.

HTH

:slight_smile:

2 Likes

That makes sense!
I made this json from a text corpus of 3 million characters where was only text and mentioned the name.
I would guess that in that case, it would have been helpful to use embedding to extract the other parameters.

I would use a traditional DB search and parse the results to get the exact key-values.

HTH

:slight_smile:

so, what would be the use of embedding? should i even use it considering i am more pointed on POI for tourism?

For what you have described so far @Gsc and also my understanding of your project, you do not need OpenAI API technology to accomplish your task.

:slight_smile:

For what you have described so far @Gsc and also my understanding of your project, you do not need OpenAI API technology to accomplish your task.

Ok :'(.
Do you mind if i DM you to give a better idea of the project?
Maybe you can tell me if it’s better to dig into embedding or not.

Hi @Gsc

Thanks for your confidence in my technical skills.

To be candid, I’m “fairly busy” with billable, commercial software development projects.; but I do volunteer my time here to help the public community of software developers who use the OpenAI API and who have less experience that I do; and so I prefer you ask questions in public, so we create a knowledge base for others (now and future visitors).

Is there some reason you do not want to continue in public?

Personally, I think it is “unfair” to the community to “take discussions private” when others may be following the discussion with interest or have similar issues, both now or in the future.

From what I have read in what you have posted here, you do not need an OpenAI technology and you can perform the task you have described, based on my understanding of your posts, with traditional database search and retrieval tech and some parsing of the results from the DB query.

Maybe consider posting more details publicly in this topic?

:slight_smile:

2 Likes

Thanks for the candid answer@ruby_coder and I understand.
As for your question. It does not change for me to go on the DB section publicly, is just that is my first post and as for our chat above It might not be significantly helpful for the community :smile:
I might be wrong on the last part so here goes my process and reasoning on this project I have been working on for the last few months.

The objective is to get articles written about the attractions or POI, the basic articles are easily done using my DB and openai, but the more complex ones are still a work in progress…

Mainly I have taken a text corpus, extracted the names, and added the other information using API’s or manual input.
The issue I am finding is that now that I have a DB and I keep appending information to the ‘Name’ I am not able to retrieve it and give me a result using Openai Davinci-03 in a satisfactory way.
My prompt for Davinci is around 1.5k tokens (tried shortening it but the results show the best this way) and each of the ‘text’ inside ‘name’ are around 800-1.2k tokens. When I try to use more than one ‘text’ I do not leave room for Davinci to fulfill the prompt in a desired way.
I believe that a solution for partially fixing the issue of tokens would be fine-tuning my model(I might do it in the future), but still, adding 4 different ‘texts’ from ‘Names’ will generate the same issue, I tried using summarization and keywords but the information ends up being lost.

This brings an issue when starting to use the GPS coordinates for nearby attractions all I can do is a chain jumping from one ‘Name’ to another getting different basic articles in order to avoid the articles cut in the middle due to not having enough tokens.

This is the reason why I thought that embedding could be a solution in order to link the ‘Name’, ‘Text’ and all the other “columns” in order to process the information in a more accurate and concise way without the need of feeding each ‘text’ to Davinci every time.

The problem for me is that you are describing your “issue” in general terms instead of technical terms via data and code.

What are the exact API parameters you are using when you fine-tune with this data?

What is an example prompt you are having issues with, what is the exact completion you are getting (which is wrong) and what is the expected output?

As a technical person, I find it easier, faster and more enjoyable to speak about technical issues and solve technical problem by focusing on the actual code, the parameters and the data (in, out, expected).

So, to be crystal clear:

  1. You are currently only using the embeddings API?
  2. You have not performed any fine-tuning or using any model completions?

:question:

Please post your your technical details of whatever you are doing above.

Thanks

:slight_smile:

  1. You are currently only using the embeddings API?

The APIs I am more familiar inside openai is text-davinci-003 and text-curie-001
text-embedding-ada-002 was where I have spend some time in the last weeks trying to make a semantic search that was going inside a python code that followed several steps.
The issue I had was having an error on my python code to access the different fields in order to be able to “narrow” the results, tried several approaches over the last weeks but was unsuccessful to get my desired output.
The funny thing is that if I was making the embedding run without “limiting” to just search inside a specific field ‘Name’ was done with the embedding json without any issue.
I was expecting to be able to use the embedding instead of continuing inside the traditional DB in order to get the information from step A to B, B to C, and F to G in order to save time and space.
At the same time, my file 0_File.json passes from 1.6mn to 30mb once embedded.
And I am not sure that it will optimize time, resources, and output.
(sorry for the big doubts)

  1. You have not performed any fine-tuning or used any model completions?

As for today, although I am interested, I have not performed any fine-tuning with openai or other models.
I manage to get the desired output results using text-curie-001 and text-davinci-003 by giving them a set of instructions to follow. I have only used it to collect information for db and create articles that follow the same structure.

I do think that is the next step in order to get more context into my articles. What do you suggest?

Hello,
I am directly sending JSON data string as input for embedding creation.
There are around 20 to 30 key and value pairs in JSON.
Now I have applied RAG architecture to this.
But When user asks for one key and value pair,
The results are not promising as there are 1 M vectors in vector database I am unable to get the optimal result in top 50 queries.
Please help me for increasing the accuracy.

1 Like

i am also trying to do the sam e which you are doing. did you find any answers,? if yes please let me know.

Hi John, I did manage to get it working!
I managed to make it work by chopping the project into sections.
After I got the sections (14 different steps) I identified that I only needed to have OpenAi API in 4 of them. This made the organization of the requests way simpler, the database (simple CSV files) more detailed, and the result more accurate.

If you are building something that make articles only from a database, identifying the sections were you can use OpenAI (or don’t need to) can speed up the whole process!!

Hope this helps you!

hi @Gsc , thanks for replying, what I am trying to do is implement RAG , I have lots of MongoDB documents which has 20 keys on average which are internal documents, and I also using a openai API for generative AI, when I ask a question for LLM , it has to first get the relevant documents from the MongoDB documents , those documents have to be fed to LLM as context for appropriate answer . so I issue I am facing is embeddings of the MongoDB documents which has several keys , if i embed the whole document at once , and insert into vector database for quering , when i query i getting very bad results. i hope you can give some inputs

hi , what I am trying to do is implement RAG , I have lots of MongoDB documents which has 20 keys on average which are internal documents, and I also using a openai API for generative AI, when I ask a question for LLM , it has to first get the relevant documents from the MongoDB documents , those documents have to be fed to LLM as context for appropriate answer . so I issue I am facing is embeddings of the MongoDB documents which has several keys , if i embed the whole document at once , and insert into vector database for quering , when i query i getting very bad results. i hope you can give some inputs