Text normalization

Hi there i am trying to use openAi to normalize a bilingual corpus to convert it into a cleaned up data set to train autoML for machines translation. Anyone here with experience or who could help me to figure out the right prompts? It’s about 50k lines (deleting duplicates, normalization, tokenization, …) Thanks

1 Like

I guess to whatever model that would work best for this task :blush:

1 Like

I wonder if I need to go through each of the 5-7 steps one by one teaching the model or if there is any shorter way. And on the other side i am wondering about the prompt it self and the settings how to write them for the system to learn.

E.g.
Shorten segments to one sentence only (working with a bilingual excel sheet). A sentence should have no more than 40 words / tokens)
Remove tags
Standarize text
Expanding contractions
Tokenize
Remove punctuations
Remove non-standard characters

Would this be crazy stuff for any of these models?

1 Like

Ok, i see so I would need to get on the waiting list there. Have you tried it already?

1 Like

Got you. So a developer with experience and access to codex?

The reason for short segments is probably to keep things simple so that the neural network can be trained more effectively with less input data.

What do you think would be possible to do it with the prompts in the playground?

The idea is then to be able to upload an excel with input segments and get a normalized text in target segments.

1 Like