Hello,
I am translating a large file with some XML-like code from German to English by breaking the file into 50-line sections and then translating each. It seems to work fantastically for anywhere between 10 - 30 requests, but then starts providing partial answers.
Here is the full script. This issue could still be a logic bug or something! Please forgive my not-great Python, it is a language I do not usually use.
import os
from openai import OpenAI
import time
from dotenv import dotenv_values
client = OpenAI(api_key=dotenv_values()["API_KEY"])
# Only translate first section
dry_run = False
def split_file_content(lines):
lines_per_section = 50
# Join each section (list of lines) into a single string
sections = [''.join(lines[i:i + lines_per_section]) for i in range(0, len(lines), lines_per_section)]
return sections
def read_file(file_path):
print(f"Starting translation of {os.path.basename(file_path)}")
try:
# Try to read the file using 'ISO-8859-1' encoding, commonly used for German text
with open(file_path, 'r', encoding='ISO-8859-1') as file:
lines = file.readlines()
except UnicodeDecodeError:
# If 'ISO-8859-1' doesn't work, try 'Windows-1252'
print(f"ISO-8859-1 decoding failed for {file_path}. Trying 'Windows-1252'.")
with open(file_path, 'r', encoding='Windows-1252') as file:
lines = file.readlines()
return lines
# Translates a section with ChatGPT
def translate_section(input):
print("Translating section!")
# Record the start time
start_time = time.time()
# Initial system prompt
system_message = {"role": "system", "content": "You are a helpful assistant."}
# Initial user prompt with the text to translate
user_message = {
"role": "user",
"content": f"Translate all German (except city names) into English while preserving the code logic. Do not translate if you are not sure if it is German or not. Be concise. Output only plain text. Do not output markdown.\n\n{input['section']}"
}
# All sleeping done in this function is for the purpose of testing whether the rate limit might be getting hit...
print("Sleeping for 10 seconds...")
time.sleep(10)
min_lines = 45
response = None
while response is None or len(response.choices[0].message.content.splitlines()) < min_lines:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[system_message, user_message],
max_tokens=4000,
temperature=0.2
)
# An attempt to see if it will work if I retry the issue
if len(response.choices[0].message.content.splitlines()) < min_lines:
print(f"Response only had {len(response.choices[0].message.content.splitlines())} lines which is less than {min_lines}; continuing the loop. Sleeping for 30 secs")
print("Writing the input to disk...")
with open("input.txt", 'a', encoding='utf-8') as file:
file.write(input['section'])
file.write("\n")
file.flush() # Ensure that the buffer is flushed to disk
# Again, trying to see if it's a rate-limit thing
time.sleep(30)
continue
message_response = response.choices[0].message.content
with open(input['translated_name'], 'a', encoding='utf-8') as file:
# Calculate and print the elapsed time in seconds
elapsed_time = time.time() - start_time
print(f"Writing temp file contents after {elapsed_time:.1f} seconds")
print(f"Response had {len(response.choices[0].message.content.splitlines())} lines")
file.write(message_response)
file.write("\n")
file.flush() # Ensure that the buffer is flushed to disk
return message_response
# Translates all sections with translate_section
def translate_all_sections(sections, translated_name):
return list(map(lambda section: translate_section({"section": section, "translated_name": translated_name}), sections))
# Processes a file (translates and overwrites it)
def process_file(file):
file_data = read_file(file)
sections = split_file_content(file_data)
file_name, file_extension = os.path.splitext(file)
translated_file_name = f"{file_name}_translated{file_extension}"
if dry_run:
translated_sections = translate_all_sections([sections[0]], translated_file_name)
else:
translated_sections = translate_all_sections(sections, translated_file_name)
with open(translated_file_name, 'w', encoding='utf-8') as file:
print("Writing temp file contents")
for section in translated_sections:
file.write(section)
file.write("\n")
file.flush() # Ensure that the buffer is flushed to disk
print(len(translated_sections))
# Main function to process all files in all subdirectories
def process_all_files_in_directory(root_directory):
i = 0
for dirpath, dirnames, filenames in os.walk(root_directory):
for filename in filenames:
if dry_run == True and i == 1:
print("Stopped early for dry-run")
return
file_path = os.path.join(dirpath, filename)
process_file(file_path)
i = i + 1
print("Completed all translations")
# Run the script on the current directory and its subdirectories
if __name__ == "__main__":
root_directory = './english'
process_all_files_in_directory(root_directory)
Within the English directory, I have the files that are partially in German. One is 13427 lines long, and one is 3193 lines long. They use a language that I do not know for their encoding, but it seems to work like XML tags (the entries are for the Great Library in a video game called Call to Power II). Here is a snippet of one of the entries:
# Advanced Composites # original
[ADVANCE_ADVANCED_COMPOSITES_PREREQ]
Voraussetzung:
<L:DATABASE_ADVANCES,ADVANCE_CONSERVATION>Umweltschutz<e>
<L:DATABASE_ADVANCES,ADVANCE_SUPERSONIC_FLIGHT>�berschallfl�ge<e>
[END]
When I try running the script against this file, it works a seemingly-random number of times and then ends up outputting only 18 lines (despite the input I give it having the full 50 lines it is supposed to). Additionally, it seems to be cutting off at the beginning - not the end.
Here is a run showcasing the 18-line issue:
Starting translation of Great_Library.txt
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 17.5 seconds
Response had 50 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 19.0 seconds
Response had 49 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 24.4 seconds
Response had 50 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 22.5 seconds
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 20.5 seconds
Response had 46 lines
Sleeping for 10 seconds...
Writing temp file contents after 22.5 seconds
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 20.5 seconds
Response had 46 lines
Writing temp file contents after 22.5 seconds
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 20.5 seconds
Response had 46 lines
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 20.5 seconds
Response had 46 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 20.5 seconds
Response had 46 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 22.4 seconds
Response had 46 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 22.4 seconds
Response had 50 lines
Translating section!
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 22.4 seconds
Response had 50 lines
Translating section!
Writing temp file contents after 22.4 seconds
Response had 50 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 22.0 seconds
Response had 48 lines
Translating section!
Response had 50 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 22.0 seconds
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 34.7 seconds
Sleeping for 10 seconds...
Writing temp file contents after 22.0 seconds
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 34.7 seconds
Response had 48 lines
Translating section!
Sleeping for 10 seconds...
Writing temp file contents after 34.7 seconds
Response had 49 lines
Translating section!
Sleeping for 10 seconds...
Sleeping for 10 seconds...
Writing temp file contents after 34.7 seconds
Response had 49 lines
Translating section!
Sleeping for 10 seconds...
Response only had 18 lines which is less than 45; continuing the loop. Sleeping for 30 secs
Writing the input to disk...
Response had 49 lines
Translating section!
Sleeping for 10 seconds...
Response only had 18 lines which is less than 45; continuing the loop. Sleeping for 30 secs
Writing the input to disk...
Translating section!
Sleeping for 10 seconds...
Response only had 18 lines which is less than 45; continuing the loop. Sleeping for 30 secs
Writing the input to disk...
Response only had 18 lines which is less than 45; continuing the loop. Sleeping for 30 secs
Writing the input to disk...
Am I running into an input / output token limit? I am not sure how I would be, given that I am only passing the single user and system message.
Could it be a logical bug, perhaps related to the lambda? I do not think this is it either, but it could be since (as mentioned) Python is not my most familiar language.
Thank you!