Can we translate whole document(PDF,doc,execil) form one langues to other?

beltbasya · April 12, 2024, 6:17pm

Hi, I am new to OpenAI and I have a question: Can we translate any document from one language to another using OpenAI by making API calls? Like:

Provide any document to OpenAI through API.
Translate it into the desired language.
Receive the document back in its original format with translated language/text?

Or do we need to extract the text from the document and then feed it to OpenAI and get translated data?

vasyl · April 12, 2024, 8:48pm

Hi @beltbasya , sure - translation is possible using API to any of the available models (except embedding ones :). You have to take into account the context window limitations (the biggest context window of 128k tokens currently has the GPT-4 Turbo) and if your text is bigger you should chunk it and loop through it with multiple API calls. OpenAI doesn’t have any specific model for translation (like Google does) but for example Whisper can transcribe speech into text and translate many languages into English. Based on my experience, I’d suggest you to scrape the text from your PDFs on your side and send a clean text via API, knowing what you’re sending. Concerning your point nr. 3 - you receive text back from the OpenAI end-point, not a PDF file. You can use Python NLTK or other libraries or services to convert it into other formats. Hope that helps.

beltbasya · April 13, 2024, 5:04am

Thanks for your time @vasyl. So, I need to extract the text from my document (PDF, Excel, DOC), send this text via API to get the translated text back, and then create the document on our side, correct? I can’t just send a document and get back the translated version, can I?

vasyl · April 13, 2024, 9:53am

yep. the OpenAI API receives text and returns text. You can upload files to OpenAI - Link, but that’s not what you’re looking for- it’s for use in Assistants. In your case, use Python to extract the content

with open('/tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

then you’d send it to OpenAI, you’d get the text response back, and would convert it back into your format, like .doc or pdf using PyPDF2, for example.

beltbasya · April 13, 2024, 1:13pm

I am getting more confused; let me clarify what I want. I want to translate a document from one language to another. So, my question is: Do I need to send its text and then get back the translated text? Or can I upload the whole document and get back the translated document in the same format that I uploaded? I understand that I can get the translated text, but can I send the document and get the translated document? Thank you.

jr.2509 · April 13, 2024, 1:30pm

The API itself only processes the text. So prior to sending the text to the API you need to add a step to extract it from the document. Likewise, you have to add steps to your script to convert the translated text back into a formatted PDF or Word. There are python libraries like the ones @vasyl has referenced that you can leverage for this.

vasyl · April 13, 2024, 6:54pm

The short answer is - Only text. Whatever file format you have- you first extract the text and send it to OpenAI model via API which also returns - text.

Topic		Replies	Views
Can GPT 4 transalate a document(ppt or word) from one language to another without changing the format of the document? API gpt-4	1	2874	October 5, 2023
Can the OpenAI integration consume a document as part of a prompt? API api	5	840	December 15, 2023
Can I use word document as input to the Grammatical error correction API in OpenAI? API	0	434	July 8, 2022
Call GPT-4 APIs to translate languages from Java micro-service API gpt-4	1	164	April 18, 2024
Can I use my own pdf/text documents to train to get an article out API	6	4280	December 23, 2023

Can we translate whole document(PDF,doc,execil) form one langues to other?

Related Topics