Passing webpages to GPT-3

juliushamilton100 · November 25, 2021, 10:35am

I’d like to try to pass a webpage to GPT-3, for two use cases.

Providing a summary of a custom length.
Asking GPT-3 questions, the answers of which can be found in that webpage.

Should I use the “Search” API endpoint for this? That almost looks more like PageRank, an algorithm that recommends a relevant document but doesn’t actually return a question answer.

Or “fine-tuning”? That looks more like for passing some structured data as examples.

Is there any good way to do this or do I have to convert the webpage to plaintext and pass it in chunks to the Completions endpoint?

------Update------:

Here’s my current vision for a “GPT-3 native” design of a web-site summarizer. Interested to hear if anyone has any better ideas.

My idea is to write a GPT-3 powered “smart-chunker”. It seems chunking is necessary for working with GPT-3 nowadays. You have to break up material into pieces that can fit into a prompt. But the overall process will feel more elegant if it’s broken at meaningful points rather than arbitrary. I will try to prompt GPT-3 to look at a text of 1000 tokens and decide where a good cut off in that text is, either by returning the cut-off version or just a line or sentence index or something. Then I’ll return the next 1000 token span of text from the source material starting after GPT-3’s suggested previous breaking point. This way, any large document needing to be passed to GPT-3 will be chunked in a more favorable way.

I’ll work with GPT-3 (Codex) to find the most universal way to obtain the essential text content of a website. It could be a simple wget, Selenium, or browser plaintext dump, with GPT-3 text and important/relevant content parsing, extracting and cleaning.
I’ll smart chunk it.
I’ll have GPT-3 summarize each chunk.

richmandan · December 2, 2021, 9:06am

GPT-3 cannot access any information outside of its own training data. If you provide a URL, it cannot access the content found at that URL.

You would first need to scrape the web page content and then pass the text in your API request.

NimbleBooksLLC · December 3, 2021, 1:31pm

I wrote a smart chunker that uses a paragraph and sentence tokenizer to look for possible break points in narrative text. In this case you could break at headings in the HTML using beautiful soup.

bram · December 6, 2021, 10:03pm

I agree with this solution.

In general, ask yourself what data you’re looking to capture off an HTML page, and think of a way to scrape that info into GPT-3

bakztfuture · December 6, 2021, 11:28pm

I’ve already seen GPT-3 take in raw HTML text and answer questions about it from the body content. Might be worth checking out.

overbeck.christopher · December 7, 2021, 2:27pm

Can is not the same as should.

NimbleBooksLLC · December 7, 2021, 8:18pm

With typical webpages you’re paying GPT to analyze a lot of junky tokens with low semantic content, probably more cost-effective to pour off some of the broth before you submit the query!

lmccallum · December 10, 2021, 6:30am

Hi @bakztfuture, where have you seen GPT-3 take it raw HTML text? My understanding is that GPT-3 can only take in json lines data (for the search endpoint) and I believe csv data (for the embeddings endpoint). I have already “poured off some of the broth” from my web pages as @NimbleBooksLLC suggested, and created a json lines file that’s been uploaded to gpt-3. My metadata in that file is my web post ids. I’m trying to figure out a way to use the embeddings endpoint to fetch a user’s query from my website, get its embedding, and then compare that embedding to a list of embeddings created from my json lines file, and finally return the most relevant web posts to my users. This is proving to be a very trickly problem. I wonder if anyone else is tackling this? Leslie

lmccallum · December 10, 2021, 6:37am

Hi @NimbleBooksLLC, do you know how to grab a user’s search string from a website search box and send it to gpt-3 for embedding? And then compare the search string’s embedding to a list of embeddings contained in a file previously uploaded to gpt-3? And then send the n best-matched embeddings from openai back to the website for display as search results? See my related question to @bakztfuture. I am trying to figure out how hard my problem really is. My data size is pretty small - 1862 rows, each row < 2048 tokens. Thanks.

NimbleBooksLLC · December 10, 2021, 5:05pm

I have not done that yet, but each individual step sounds doable. Are these rows short answers to legal questions? Do you have a definition for what “best-matched” embeddings means? Tweaking that definition might affect the quality of results. There are a lot of different ways you can measure similarity.

Topic		Replies	Views
Summarizing or question answering from long Wikipedia articles? API	25	3904	January 4, 2024
How do I summarise a block of text larger than the token limit? API	13	9069	December 17, 2023
How to search/answer with formatted documents on large knowledgebases Prompting	8	2555	May 15, 2023
Generating a report from a limited corpus? Prompting	13	2547	December 16, 2023
Is it possible to fine-tune a model to answer questions given a raw text? Prompting	18	10169	December 15, 2023

Passing webpages to GPT-3

Related topics