Summarizing or question answering from long Wikipedia articles?

My project would benefit from being able to summarize or extract answers from arbitrarily long Wikipedia articles. Many of them are over 2048 tokens long. I was wondering if there are API endpoints and/or settings I can use to avoid this problem? Or if there are other techniques?

From my previous work, I have the entirety of Wikipedia stored offline in a local SOLR instance, so I can search 6 million articles very quickly. The bottleneck becomes GPT-3’s limitations.

3 Likes

I suggest you try the /answers endpoint, which handles Q&A based on a larger context.

1 Like

Just curious – what are the storage requirements for such a setup? Do You store images too?

Oh excellent, thank you. I hadn’t realized you could upload data. I will have to explore that as a possibility. Fortunately, GPT-3 is very good at generating questions to ask.

You can check out my git repo here: GitHub - daveshap/PlainTextWikipedia: Convert Wikipedia database dumps into plaintext files
Full English Wikipedia is about 80GB unzipped, 40GB once you remove all the markdown and HTML. The SOLR data ends up being a bit more than 40GB due to the size of the index, but is offset by native compression that SOLR uses. The heavier demand is memory. I think my instance of SOLR ended up using about 12GB of RAM.

1 Like

There’s a limit of 1gb for file size for now. It wouldn’t make sense for us to enable anyone to duplicate Wikipedia as that would be very expensive to store very quickly. If q and a based on Wikipedia is a common use case we may consider storing an up to date version of Wikipedia, accessible to anyone.

I don’t see a legitimate usecase for us to store images,since all our tools are related to language processing only.

10 Likes

I see. This is some good info. Also thanks for the resource! :+1:

1 Like

Oh man, it would be PHENOMENAL if we could just use GPT-3 to query Wikipedia as an endpoint. I have no problem doing it locally/offline, but as per the documentation, using empirical sources like Wikipedia are a chief antidote against confabulation.

I’ve also just finished a NEWS service where I ingest a metric ton of RSS feeds into my SOLR instance so that my RAVEN project can be Mr. Current Events. Fortunately, RSS feeds are just titles and descriptions so those lend themselves to easy summary with GPT-3.

Anyways, I’m afraid that uploading dozens of Wikipedia articles per conversation will make RAVEN too slow so I will focus on using my local SOLR version for now. Anyways, if y’all do integrate GPT-3/Wikipedia into the Answers endpoint that would be a game-changer!

EDIT: For instance, I was testing Raven’s ability to hold a conversation about nuclear fusion, so one question that came up was What are the challenges of commercializing nuclear fusion? - GPT-3 has some built-in knowledge about this topic but having news and Wikipedia sources would only serve to improve accuracy.

I’ll take a note of that, thanks!

2 Likes

Such functionality will ultimately be required for GPT-3 (or future iterations) to be used reliably for medical, legal and financial purposes. I envision a future where a large transformer is tightly integrated with a large corpus of data (or several, actually) and so the information it provides can be considered robust and reliable.

Please keep me updated if you make in headway on this domain as I am intensely interested!

3 Likes

Back to your original question - you already have Wikipedia in your own database, and you can find relevant articles. These are sometimes larger than 2048 tokens, in which case you could use the /answer endpoint to upload chunks of those wikipedia articles dynamically to then answer the question based on the most relevant chunk from that Wikipedia article. Does that work for your use case for now? This doesn’t require the entire Wikipedia being uploaded at once, but only a few documents, which aren’t that large at all.

1 Like

Yeah that may be worth testing, my greatest concern is speed, though. Raven is meant to reply in real-time and it only takes a few milliseconds to fetch articles from a local SOLR. I’ll have to do some testing as it’s already relatively slow (sometimes takes Raven 10+ seconds to reply). Granted, that’s somewhat realistic as humans do not reply instantly via chat. However, once I give Raven a voice interface that would be very weird and disjointed since humans can reply verbally instantly. So therefore my concern is latency for uploading and handling files. But certainly it will be worth testing!

What about parsing the article titles, then a limited selection of content tables, then the relevant sections instead?

I’d also like to see this enrich Wikipedia by including a translation of the relevant other Wikipedia sections.

I started with just indexing titles but you miss a lot of information there. Human memory is not based on titles, but on content. In fact, I include the title of the article as well as the main body in the indexed “content” field. SOLR provides powerful search and ranking tools so I can have a high confidence that I am finding all relevant articles.

Dealing with other localities will come once I’ve got Raven up and running commercially in English. As Raven is an open-ended chatbot, the bar is very high to get approval. Fortunately I’ve made some good headway on that front, but there is still testing to do.

1 Like

Nice!

I’m curious what would come from it rebuilding those articles and then summarizing.

That’s a waste of energy/processing. There’s no point in summarizing an article when you don’t know why you’re summarizing it. For instance, the question might about the economics of nuclear fusion OR China’s progress on nuclear fusion. These two different contexts drastically change the information you’re interested in.

Also, Wikipedia has 6 million articles with monthly updates. That’s a lot of inference time to summarize all of them.

Sorry for the miscomm. I meant at a certain point of limited relevant articles, say less than 2^8, it would rebuild that selection of articles.

That’s possible. It also depends on the length. In the future, I imagine that GPT-3 (or future iterations) might be used for legal or medical questions. That means that the amount of relevant text required to answer queries is going to include many volumes of books, not just Wikipedia articles, so I’m not sure what cases could be so limited as to justify summarizing all articles within a domain. Once you get to a certain volume of text, it might make more sense to train a model on it specifically.

Say, for instance, you accumulate all medical texts and then fine-tune GPT-3 on it so that you have a GPT-MED bot or something like that.

1 Like

My girlfriend’s father is a physician so I have a little bit of insight into how the industry perceives these tools. Most likely, such tools will be used as adjuncts by physicians or other medical personnel for the foreseeable future. Over time, with reliability tested in the field, AI agents could graduate to actually provide care, but that’s a long ways off regardless of approvals, etc. Low hanging fruit is to add value to PCP and urgent care environments where you may not have a full physician present at all times, but instead have only RNs or NPFM (nurse practitioner of family medicine). Such an adjunct tool can be used to rapidly assist nurses in lieu of a full MD. He did agree that hospitals will desire to reduce staff as much as possible, since physicians are (1) expensive and (2) only human :wink:

1 Like

I am intensely interested too. I am building a large corpus of specialized text that will eventually go beyond 1 GB. I want my customers to have a google-like search INPUT experience, and only get high-quality query results as OUTPUT. I’m using the search endpoint; only my data is returned to my customer in search results. GPT-3’s role in my workflow is to interpret my customer’s natural language query, compare the query to my technical dataset, and provide the best ranked results from my dataset. More generally, I believe that there is tremendous opportunity for GPT-3’s customers to create “Good Google” products by removing both noise and bias before a search is done. GPT-3 has used the whole internet for training, and now GPT-3’s customers want to use their own collections, consisting of only the good stuff, to create search products devoid of junk. The ability to exceed the 1 GB limit will faciliate these very powerful use cases, as @vertinski said, where the data against which a search is being done is highly specialized or technical.

3 Likes