Text normalization

janhinrichs · August 19, 2021, 11:57pm

Hi there i am trying to use openAi to normalize a bilingual corpus to convert it into a cleaned up data set to train autoML for machines translation. Anyone here with experience or who could help me to figure out the right prompts? It’s about 50k lines (deleting duplicates, normalization, tokenization, …) Thanks

janhinrichs · August 20, 2021, 6:53am

I guess to whatever model that would work best for this task

janhinrichs · August 20, 2021, 7:10am

I wonder if I need to go through each of the 5-7 steps one by one teaching the model or if there is any shorter way. And on the other side i am wondering about the prompt it self and the settings how to write them for the system to learn.

E.g.
Shorten segments to one sentence only (working with a bilingual excel sheet). A sentence should have no more than 40 words / tokens)
Remove tags
Standarize text
Expanding contractions
Tokenize
Remove punctuations
Remove non-standard characters
…

Would this be crazy stuff for any of these models?

janhinrichs · August 20, 2021, 9:40am

Ok, i see so I would need to get on the waiting list there. Have you tried it already?

janhinrichs · August 20, 2021, 12:46pm

Got you. So a developer with experience and access to codex?

The reason for short segments is probably to keep things simple so that the neural network can be trained more effectively with less input data.

What do you think would be possible to do it with the prompts in the playground?

The idea is then to be able to upload an excel with input segments and get a normalized text in target segments.

Topic		Replies	Views
Fine tune model with empty prompts API	4	1662	December 17, 2023
Generating Text using a pre-defined project Prompting	0	622	May 21, 2022
Use "private" dataset as basis for AI responses Prompting	29	2954	December 16, 2023
Ideas of what to do with GPT-3 Prompting	14	1169	January 4, 2024
Question regarding prompt token calculation Prompting	1	695	August 14, 2021

Text normalization

Related topics