Need serious beta testers for TRL5: on-prem dataset cleaning pipeline using OpenAI API

Hi everyone,

I’m looking for a small number of serious technical beta testers for Purify Factory, a model-agnostic on-prem pipeline for cleaning noisy text datasets at scale before downstream AI use.

This is not an open-source release and not a casual “try it if you feel like it” beta.
This phase is meant to support TRL5 validation, so I’m specifically looking for testers who can run real, structured tests and provide useful, reproducible feedback.

What Purify Factory does:

  • cleans noisy multilingual text datasets

  • processes JSONL datasets with sentence or text fields

  • produces auditable output with original text, cleaned text, token usage, cost, and provider metadata

  • supports multiple providers, including OpenAI, plus Anthropic, Gemini, and local backends

What I need from testers:

  • Linux x86_64 environment

  • willingness to test on a real dataset, not toy samples

  • ideally 1,000+ records minimum; 5,000+ preferred for stronger validation

  • ability to report installation issues, runtime behavior, failure modes, output quality, and edge cases in a structured way

  • willingness to share feedback that is actually usable for certification and release hardening

Important constraints:

  • the repo is the beta access point, not a source-code release

  • access requires a personal free beta license

  • the license is machine-specific

  • testers must use their own API key / credits

  • this is best suited for developers, data engineers, NLP practitioners, or technical teams already working with text preprocessing / LLM data pipelines

If you are interested and you can run a serious validation pass, the beta repo is here:

https://github.com/mentoratechnologies/PurifyFactory-Beta

I’m especially interested in feedback from people already using the OpenAI API in production or evaluation pipelines, because I want to understand how Purify Factory behaves in real data-preparation workflows, not just in isolated demos.

If you match this profile and want to participate, reply in the thread or reach out through the repo instructions.

1 Like

This is preposterously dumb.

Download closed-source Linux software, let it scrape and transmit data about your system.
Develop your own data with thousands of entities and provide it to someone.
Use your own API credits for whatever the code wants to perform.
To benefit nobody but a for-profit closed entity that joined the forum two days before advertising a repo with one contributor with nothing else.

Oh, and most amazingly, you write a system message also, " This quality standard is defined by you through the system prompt — describe the cleaning rules you want to apply in natural language, and PurifyFactory applies them consistently and verifiably to every record in the dataset.". So you get to be a prompt engineer to someone that can’t do that to deliver their product.

This deserves a lock and a de-list from the forum is the “feedback”. $50 is my 15 minute increment for my time, I’ll be sending the bill.

3 Likes