I’d like to try to pass a webpage to GPT-3, for two use cases.
- Providing a summary of a custom length.
- Asking GPT-3 questions, the answers of which can be found in that webpage.
Should I use the “Search” API endpoint for this? That almost looks more like PageRank, an algorithm that recommends a relevant document but doesn’t actually return a question answer.
Or “fine-tuning”? That looks more like for passing some structured data as examples.
Is there any good way to do this or do I have to convert the webpage to plaintext and pass it in chunks to the Completions endpoint?
Here’s my current vision for a “GPT-3 native” design of a web-site summarizer. Interested to hear if anyone has any better ideas.
My idea is to write a GPT-3 powered “smart-chunker”. It seems chunking is necessary for working with GPT-3 nowadays. You have to break up material into pieces that can fit into a prompt. But the overall process will feel more elegant if it’s broken at meaningful points rather than arbitrary. I will try to prompt GPT-3 to look at a text of 1000 tokens and decide where a good cut off in that text is, either by returning the cut-off version or just a line or sentence index or something. Then I’ll return the next 1000 token span of text from the source material starting after GPT-3’s suggested previous breaking point. This way, any large document needing to be passed to GPT-3 will be chunked in a more favorable way.
I’ll work with GPT-3 (Codex) to find the most universal way to obtain the essential text content of a website. It could be a simple wget, Selenium, or browser plaintext dump, with GPT-3 text and important/relevant content parsing, extracting and cleaning.
I’ll smart chunk it.
I’ll have GPT-3 summarize each chunk.
GPT-3 cannot access any information outside of its own training data. If you provide a URL, it cannot access the content found at that URL.
You would first need to scrape the web page content and then pass the text in your API request.
I wrote a smart chunker that uses a paragraph and sentence tokenizer to look for possible break points in narrative text. In this case you could break at headings in the HTML using beautiful soup.
I agree with this solution.
In general, ask yourself what data you’re looking to capture off an HTML page, and think of a way to scrape that info into GPT-3
I’ve already seen GPT-3 take in raw HTML text and answer questions about it from the body content. Might be worth checking out.
Can is not the same as should.
With typical webpages you’re paying GPT to analyze a lot of junky tokens with low semantic content, probably more cost-effective to pour off some of the broth before you submit the query!
Hi @bakztfuture, where have you seen GPT-3 take it raw HTML text? My understanding is that GPT-3 can only take in json lines data (for the search endpoint) and I believe csv data (for the embeddings endpoint). I have already “poured off some of the broth” from my web pages as @NimbleBooksLLC suggested, and created a json lines file that’s been uploaded to gpt-3. My metadata in that file is my web post ids. I’m trying to figure out a way to use the embeddings endpoint to fetch a user’s query from my website, get its embedding, and then compare that embedding to a list of embeddings created from my json lines file, and finally return the most relevant web posts to my users. This is proving to be a very trickly problem. I wonder if anyone else is tackling this? Leslie
Hi @NimbleBooksLLC, do you know how to grab a user’s search string from a website search box and send it to gpt-3 for embedding? And then compare the search string’s embedding to a list of embeddings contained in a file previously uploaded to gpt-3? And then send the n best-matched embeddings from openai back to the website for display as search results? See my related question to @bakztfuture. I am trying to figure out how hard my problem really is. My data size is pretty small - 1862 rows, each row < 2048 tokens. Thanks.
I have not done that yet, but each individual step sounds doable. Are these rows short answers to legal questions? Do you have a definition for what “best-matched” embeddings means? Tweaking that definition might affect the quality of results. There are a lot of different ways you can measure similarity.