Passing webpages to GPT-3

I’d like to try to pass a webpage to GPT-3, for two use cases.

  1. Providing a summary of a custom length.
  2. Asking GPT-3 questions, the answers of which can be found in that webpage.

Should I use the “Search” API endpoint for this? That almost looks more like PageRank, an algorithm that recommends a relevant document but doesn’t actually return a question answer.

Or “fine-tuning”? That looks more like for passing some structured data as examples.

Is there any good way to do this or do I have to convert the webpage to plaintext and pass it in chunks to the Completions endpoint?

------Update------:

Here’s my current vision for a “GPT-3 native” design of a web-site summarizer. Interested to hear if anyone has any better ideas.

My idea is to write a GPT-3 powered “smart-chunker”. It seems chunking is necessary for working with GPT-3 nowadays. You have to break up material into pieces that can fit into a prompt. But the overall process will feel more elegant if it’s broken at meaningful points rather than arbitrary. I will try to prompt GPT-3 to look at a text of 1000 tokens and decide where a good cut off in that text is, either by returning the cut-off version or just a line or sentence index or something. Then I’ll return the next 1000 token span of text from the source material starting after GPT-3’s suggested previous breaking point. This way, any large document needing to be passed to GPT-3 will be chunked in a more favorable way.

  1. I’ll work with GPT-3 (Codex) to find the most universal way to obtain the essential text content of a website. It could be a simple wget, Selenium, or browser plaintext dump, with GPT-3 text and important/relevant content parsing, extracting and cleaning.

  2. I’ll smart chunk it.

  3. I’ll have GPT-3 summarize each chunk.

4 Likes

GPT-3 cannot access any information outside of its own training data. If you provide a URL, it cannot access the content found at that URL.

You would first need to scrape the web page content and then pass the text in your API request.

1 Like

I wrote a smart chunker that uses a paragraph and sentence tokenizer to look for possible break points in narrative text. In this case you could break at headings in the HTML using beautiful soup.

1 Like

I agree with this solution.

In general, ask yourself what data you’re looking to capture off an HTML page, and think of a way to scrape that info into GPT-3

1 Like

I’ve already seen GPT-3 take in raw HTML text and answer questions about it from the body content. Might be worth checking out.

1 Like

Can is not the same as should.

1 Like

With typical webpages you’re paying GPT to analyze a lot of junky tokens with low semantic content, probably more cost-effective to pour off some of the broth before you submit the query!

2 Likes

Hi @bakztfuture, where have you seen GPT-3 take it raw HTML text? My understanding is that GPT-3 can only take in json lines data (for the search endpoint) and I believe csv data (for the embeddings endpoint). I have already “poured off some of the broth” from my web pages as @NimbleBooksLLC suggested, and created a json lines file that’s been uploaded to gpt-3. My metadata in that file is my web post ids. I’m trying to figure out a way to use the embeddings endpoint to fetch a user’s query from my website, get its embedding, and then compare that embedding to a list of embeddings created from my json lines file, and finally return the most relevant web posts to my users. This is proving to be a very trickly problem. I wonder if anyone else is tackling this? Leslie

Hi @NimbleBooksLLC, do you know how to grab a user’s search string from a website search box and send it to gpt-3 for embedding? And then compare the search string’s embedding to a list of embeddings contained in a file previously uploaded to gpt-3? And then send the n best-matched embeddings from openai back to the website for display as search results? See my related question to @bakztfuture. I am trying to figure out how hard my problem really is. My data size is pretty small - 1862 rows, each row < 2048 tokens. Thanks.

1 Like

I have not done that yet, but each individual step sounds doable. Are these rows short answers to legal questions? Do you have a definition for what “best-matched” embeddings means? Tweaking that definition might affect the quality of results. There are a lot of different ways you can measure similarity.