A potential new source of training data!?

andrewsilber · July 3, 2024, 6:22am

Hello!
This is not directly about the API per se, but it’s germane to OpenAI anyway:

Why does Sam & Co not strike a deal with the US Government to digitize the entire Library of Congress? Presumably there would be some rights issues to be dealt with, of course. However such a large corpus of high quality tokens would indeed allow our models to derive even more and more high-level skills, then perhaps it should be considered a matter of national security, since it’s all about the “AI arms race” now.

Just my 2c…

sps · July 3, 2024, 1:03pm

Hi @andrewsilber,

I am not aware of the actual size of the Library of Congress; however, they already have digitized content available at Digital Collections, Available Online | Library of Congress

Topic		Replies	Views
Can OpenAI Help Power a Public-Interest Library AI Project? Community api , gpt-5	3	127	November 7, 2025
The Pile Dataset (800GB) for the next GPT API	4	1938	April 30, 2021
Training data - books1 & books2 API	2	3216	July 7, 2021
How OpenAI can become a more fact and credit-based information resource Community gpt-4	2	177	November 3, 2024
Feature Request (analyzing external documents) API	3	958	December 17, 2022

A potential new source of training data!?

Related topics