I am trying to write a program to translate documents from any language they are written in (among supported languages ofc) into english, also trying to keep their structure (lists, comma separated values, paragraphs, etc.). Working with a large text required chunking, but then it might ruin the structure of the document, should I just write some code to “cut” text better, or is there another approach to translate a big text?
Within the limitation on the context window size, the best optimum approach would be to chunk the text into smaller sizes and then translate them.
If you’re worried about context loss, maybe overlapping a piece (say a paragraph) in translations before removing it in the final join might be able to keep some context loss from happening
Thank you for response, how whould you suggest to remove overlaps later?
I know this will not be a shot solution that I propose, but if you know say, like the last 3 lines are being overlapped, you could remove them from the following text and then ask GPT to smooth it out for you (I had two texts in xxx language from which had a small overlap in them. Could you smoothen it out for me. )
Maybe this might work for you ?