Hi everyone,
I’m looking for a small number of serious technical beta testers for Purify Factory, a model-agnostic on-prem pipeline for cleaning noisy text datasets at scale before downstream AI use.
This is not an open-source release and not a casual “try it if you feel like it” beta.
This phase is meant to support TRL5 validation, so I’m specifically looking for testers who can run real, structured tests and provide useful, reproducible feedback.
What Purify Factory does:
-
cleans noisy multilingual text datasets
-
processes JSONL datasets with
sentenceortextfields -
produces auditable output with original text, cleaned text, token usage, cost, and provider metadata
-
supports multiple providers, including OpenAI, plus Anthropic, Gemini, and local backends
What I need from testers:
-
Linux x86_64 environment
-
willingness to test on a real dataset, not toy samples
-
ideally 1,000+ records minimum; 5,000+ preferred for stronger validation
-
ability to report installation issues, runtime behavior, failure modes, output quality, and edge cases in a structured way
-
willingness to share feedback that is actually usable for certification and release hardening
Important constraints:
-
the repo is the beta access point, not a source-code release
-
access requires a personal free beta license
-
the license is machine-specific
-
testers must use their own API key / credits
-
this is best suited for developers, data engineers, NLP practitioners, or technical teams already working with text preprocessing / LLM data pipelines
If you are interested and you can run a serious validation pass, the beta repo is here:
https://github.com/mentoratechnologies/PurifyFactory-Beta
I’m especially interested in feedback from people already using the OpenAI API in production or evaluation pipelines, because I want to understand how Purify Factory behaves in real data-preparation workflows, not just in isolated demos.
If you match this profile and want to participate, reply in the thread or reach out through the repo instructions.