Assistant that retrieves informations from a website

tech14 · January 1, 2024, 3:25pm

I successfully built an assistant that retrieves specific infos from some input files.
I would like to upgrade it by making it possible to feed it urls too. It seems like it cannot browse the web on its own, so i thought about downloading the html and try to artificially navigate to the pages where the specific infos i am looking for might be. I made a tool function that retrieves markers and tries to add context to it, to then choose to navigate or not (meaning downloading the html or not).
But first, it doesnt work that well, second, it can be very expensive in terms of tokens if the page is big or if it chooses to browse many pages.
Do you guys have any clues on how to optimize this ? I would like to be able to browse any basic website.
To give a practical example, be able to find the name of the founder of this company, just by feeding my assistant the company’s website url: https : // www .ghst .io /

_j · January 1, 2024, 3:53pm

How about an additional tool function for “site_index”. Give each page a position in a hierarchy, description, length, retrieval document number.

Then a tool that retrieves by document number.

If the site map would be constantly called upon, you can place it in instructions to avoid the context cost and delay of an additional function for retrieving it.

matcha72 · January 1, 2024, 5:28pm

You need to use a scraper which can download the content from the website and use it to feed to a GPT as information

bruce.dambrosio · January 1, 2024, 6:14pm

uses wordfreq to prune page retrievals to only semi-relevant text before asking an llm to process. generally results in a factor of 10 size reduction with little loss of relevant content. Perhaps some of that code might be useful. look in the ‘google-search… .py’ file.

soffosdotai · January 1, 2024, 8:22pm

To optimize your assistant for web browsing, consider using a more efficient algorithm for parsing HTML content. Additionally, focus on refining the tool’s decision-making process to minimize unnecessary page downloads. You can explore techniques like content summarization or prioritizing specific HTML elements to reduce token consumption. To extract specific information like the founder’s name, use targeted queries or heuristics tailored to common webpage structures. Experiment with adjusting context window sizes and marker identification for better performance. Regularly test and iterate to strike a balance between accuracy and token efficiency in handling diverse websites.

tech14 · January 2, 2024, 12:11am

I am new to building AI-based apps and I just discovered what LangChain was about.
Do you think I should stick with the raw OpenAI API or use LangChain could help a lot ? From what I understood, it seems very powerful.
I’d like to be aware of several strategies before starting to implement.

Anyways, thanks a lot for your answers guys, all your insights are much appreciated

Topic		Replies	Views
Anyone managed to build proper browsing function? API	4	979	November 30, 2023
Building a chatbot using Llamaindex, Langchain, and OpenAI API for document-based answers API	2	5829	June 12, 2023
A suggestion about assistant's instruction API assistants , assistants-api	4	784	December 12, 2023
How do you use the Assistants API? API assistants-api	22	6767	August 2, 2024
Newbie needs your advice to get started in API API chatgpt	1	507	January 26, 2024

Assistant that retrieves informations from a website

Related Topics