Best practices of searching for identifiers


I’m developing a solution for editing a website where I need to identify which page of the website a person is referring to in the chatbot.

For example, a user is asking “Add a phone number to contact page.” and I need the model to identify specific page on the website the user is referring to.

What are current best practices to solve such problem?

The ones that come to mind are:

  • send all pages and their id’s in the prompt as for example json and let the model find the appropriate page; while this one is very enticing and I would guess efficient, it won’t be a good solution when there are hundreds or even thousands of pages
  • create a vector db of page urls and titles and use RAG; not sure how efficient is RAG when dealing with identifiers (title and url) rather than information itself (content of the page)

What do you think of the pro’s and con’s of these 2 approaches and have you heard/used any other approaches for similar problems?

1 Like

A website is pretty much well structured because of SEO. If you can implement a web crawler to sniff the links, you can probably create a visual sitemap and pinpoint which page is which. You can probably refine the result further using AI.

Not sure I’m following. Could you elaborate?

Do you not have a API spec for each page. One option will be to provide the spec to the LLM as a context but the spec needs to be well documented.

My first idea is to create a mapping of the pages and actions and a complimentary function for the model to call to look up the correct action. This could also be done via knowledge a graph search.
In a first step have the model decide if there’s a request for an action and then pass the request to a model instance responsible for executing the query. Actually performing the action can then be done by another instance or by the requesting model instance.

One advantage would be that the requests can be modeled as events and allow to decouple the background steps from the other user interactions.

It’s possible that you may end up with many different possible actions making it more challenging for the model to select the best course of action but the intent classification and invoking a separate instance should help to mitigate potential issues.

1 Like

have you used wget to download websites? if you give it a URL, it will download the page, sniff all the links and resources in the page, then go to the next page and do the same thing. if you can implement a similar program, without dowloading actual page or resource, just crawling the links, you can make a visual sitemap of the entire website. because of the need for SEO, webdevs structure websites so that it will be intuitive. which means a contact page will most likely be at /contact or about page at /about or blogs at /blog. you can probably use AI to further enhance to find specific blog entry, looking at the metadata, which each page will have I am sure, maybe using RAG.

Sounds like my first option mentioned in the post. If not would be great to hear more details :slight_smile:

For now I’m working on the first step of identifying the page (actions will be the second step).

I tested the first solution I mentioned in the head post and unsurprisingly it works great (at least on 4-turbo), but it is not token-efficient especially for the websites with hundreds of pages.

That is why I’m researching other ways to do the same.

1 Like

I have sitemap. The challenge is to solve the problem having in mind potential ambiguity (there may be multiple pages that will fit user’s request) and token-efficiency.

If it is using swagger , you can share swagger documentation spec. Also look at Gorilla GPT. I have not tried Gorilla yet but looks like it is fine tuned for APIs

1 Like

Considering you need the model to select a page from hundreds, the first option may not be suitable, as it struggles with handling similar pages consistently. Instead, I suggest using the Retrieval-Augmented Generation (RAG) approach, where the model generates a query and then calls an external API to perform a semantic search. First, prepare your data by transforming page descriptions and IDs into vectors—one vector per page. You may store the page ID or URL in the metadata of the vector database. Then, you can perform a semantic search to find the most relevant page for the query. Additionally, you might consider writing instructions for the model to ask users to confirm their desired page if multiple results are returned. You can define whether to show only the top result or the top three.

Considering you need the model to select a page from hundreds, the first option may not be suitable, as it struggles with handling similar pages consistently.

Both approaches will struggle from this. This is inevitable and only can be solved by double-checking with the user in case of ambiguity.

Instead, I suggest using the Retrieval-Augmented Generation (RAG) approach

Did you use it specifically for finding identifiers like in my case? My gut feeling and what a few colleagues also think is that RAG is not good for searching for identifiers.

I believe that employing the second approach with RAG might yield better results. You can utilize the capabilities of a vector database to store identifiers or URLs as metadata for each page’s embedding vector. When conducting a semantic search to retrieve a record, the database can also return the identifier or URL. This would allow for further operations to be carried out seamlessly.