How to provide URL content to assistants

Hello,
i’m building an assistant and as part of the messages users can post urls.
Now i want the assistant to access the content of the urls and use them as knowledge of that thread (not message)

what’s the way of doing it?
I image two solutions, but not sure what’s the best way:

  • I can pre-process the message in my python script, download the html and use it as input (but how this should be done?)
  • I can create a function that fetches the data (here as well, not sure how to do it properly)

any tip or help is welcomed.

Hi @Esseti and welcome to the community!

I would separate this out into two functions:

  1. get_url_from_messages(messages: List[str]) -> str
  2. get_contents_from_url(url: str) -> str

The first one would essentially use some regex to find any URLs. The second one you can make it as simple or complex as you like. In the simplest case, if you are using Python, you would use something like requests. In a more complex case, if you want more control and “intelligence”, you would use something like Apify web scraping service (I’ve used Apify before and it’s very good!).

You define those functions as part of the function calling feature, and implement the relevant API endpoints.

Some things to think about:

  • You may also want to indicate “intent” when it comes to URLs, i.e. should the URL be used for knowledge, True / False? For this you probably need to maybe implement a simple True / False classification, maybe using a smaller model like a mini?
  • You may want to look at caching solutions - so every time you scrape the contents from that URL, it takes time; you may want to cache this into a db and avoid this additional latency next time, but this leads to lot of other complexities