Need help to translate text exceeding token window

rubyrain399 · October 11, 2023, 12:44pm

I am trying to write a program to translate documents from any language they are written in (among supported languages ofc) into english, also trying to keep their structure (lists, comma separated values, paragraphs, etc.). Working with a large text required chunking, but then it might ruin the structure of the document, should I just write some code to “cut” text better, or is there another approach to translate a big text?

udm17 · October 11, 2023, 12:47pm

Within the limitation on the context window size, the best optimum approach would be to chunk the text into smaller sizes and then translate them.
If you’re worried about context loss, maybe overlapping a piece (say a paragraph) in translations before removing it in the final join might be able to keep some context loss from happening

rubyrain399 · October 11, 2023, 1:20pm

Thank you for response, how whould you suggest to remove overlaps later?

udm17 · October 12, 2023, 5:44am

I know this will not be a shot solution that I propose, but if you know say, like the last 3 lines are being overlapped, you could remove them from the following text and then ask GPT to smooth it out for you (I had two texts in xxx language from which had a small overlap in them. Could you smoothen it out for me. )

Maybe this might work for you ?

Topic		Replies	Views
What do I do if I want to keep the context of a sentence when I get token restrictions? API	1	40	September 25, 2024
Error in the translation Prompting	1	956	January 7, 2023
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8491	December 17, 2023
Need Suggestions on how to convert Subtitles to a foreign language API gpt-4	2	99	October 13, 2024
Best practice for a big RAG API chatgpt	7	828	May 11, 2024

Need help to translate text exceeding token window

Related topics