Assistant that retrieves informations from a website

I successfully built an assistant that retrieves specific infos from some input files.
I would like to upgrade it by making it possible to feed it urls too. It seems like it cannot browse the web on its own, so i thought about downloading the html and try to artificially navigate to the pages where the specific infos i am looking for might be. I made a tool function that retrieves markers and tries to add context to it, to then choose to navigate or not (meaning downloading the html or not).
But first, it doesnt work that well, second, it can be very expensive in terms of tokens if the page is big or if it chooses to browse many pages.
Do you guys have any clues on how to optimize this ? I would like to be able to browse any basic website.
To give a practical example, be able to find the name of the founder of this company, just by feeding my assistant the company’s website url: https : // www .ghst .io /

How about an additional tool function for “site_index”. Give each page a position in a hierarchy, description, length, retrieval document number.

Then a tool that retrieves by document number.

If the site map would be constantly called upon, you can place it in instructions to avoid the context cost and delay of an additional function for retrieving it.

2 Likes

You need to use a scraper which can download the content from the website and use it to feed to a GPT as information

uses wordfreq to prune page retrievals to only semi-relevant text before asking an llm to process. generally results in a factor of 10 size reduction with little loss of relevant content. Perhaps some of that code might be useful. look in the ‘google-search… .py’ file.

2 Likes

To optimize your assistant for web browsing, consider using a more efficient algorithm for parsing HTML content. Additionally, focus on refining the tool’s decision-making process to minimize unnecessary page downloads. You can explore techniques like content summarization or prioritizing specific HTML elements to reduce token consumption. To extract specific information like the founder’s name, use targeted queries or heuristics tailored to common webpage structures. Experiment with adjusting context window sizes and marker identification for better performance. Regularly test and iterate to strike a balance between accuracy and token efficiency in handling diverse websites.

I am new to building AI-based apps and I just discovered what LangChain was about.
Do you think I should stick with the raw OpenAI API or use LangChain could help a lot ? From what I understood, it seems very powerful.
I’d like to be aware of several strategies before starting to implement.

Anyways, thanks a lot for your answers guys, all your insights are much appreciated :slight_smile: