Hello OpenAI team,
I would like to propose a new feature that could significantly enhance the efficiency and sustainability of training GPT models. As someone deeply interested in developing specialized versions of GPT, I believe there is a need for a guidance tool that can assist users in curating, summarizing, and organizing their data before integrating it into the training process. Here are the main points of my suggestion:
- Guided Data Preparation: A tool that helps users to identify redundant or irrelevant data and summarize PDFs before training, providing a “data quality score” to ensure only essential information is used.
- Real-Time Feedback on Training: An interactive interface that shows, in real-time, the impact of added data on the model’s capacity and computational load. This would help users understand the limits of token capacity and efficiently manage resources.
- Educational Component: Integrated educational resources to teach developers and future generations about responsible data management in AI, emphasizing quality over quantity. This feature could foster a culture of intelligent and sustainable AI training.
Such a tool would not only save significant computational resources for OpenAI by avoiding unnecessarily large or inefficient training datasets but also ensure more focused and effective training outcomes. Additionally, it would help developers avoid “data inflation,” which often results from adding excessive information without assessing its relevance.
I strongly believe this feature could be a step towards more ethical AI development, and I would love to hear the community’s thoughts and suggestions on this idea.
Thank you for considering this suggestion!