Need help creating a copy editor for novels and other long texts

Hi everyone,

I am very new to coding, so apologies if this seems very misguided and bad. Any advice here I would be grateful for. I have looked around and found various copy editing tools, but nothing quite like what I’m trying to create. I want to make a website where the user can submit a .txt file of basically any length (maybe up to a 500,000 words, as there has to be a cap somewhere!) and get back a .txt file that has been copy edited.

I’ve been using the openai.Completion.create method. I tried the openai.Edit.create version and found it wasn’t very strong–missed a lot of easy mistakes such as “he” instead of “the” and whatnot.

I started by breaking up the submitted .txt files by paragraphs and making API calls with 1000 word chunks, then writing the results straight into a new .txt file. This worked pretty well, but if someone happens to submit a file that has gigantic paragraphs, then a submitted chunk of text will exceed the token limit. So now I’m trying to break text down by character count, but I think this is screwing up my results, because now GPT keeps wanting to add content to the text instead of make simple corrections. I may try using the format shown in the grammar example but I just wanted to go ahead and ask for feedback before working on that. Is there a simpler way to get back results? I could not quite figure out how to submit an entire file to the API then ask for results, so I’m trying this method for now.

def run_editor(key):
    #clear the edited.txt file
    edited_text = open("uploads/edited.txt", "w", encoding='utf-8', errors="ignore")
    edited_text.close()
    #reopen with "append"
    edited_text = open("uploads/edited.txt", "a", encoding='utf-8', errors="ignore")

    #open the submitted file
    with open("uploads/original.txt", "r", encoding='utf-8', errors="ignore") as f:
        original_text = f.read()
        f.close
    paragraph_text = original_text.split("\n")
    submit_text = ""
    
    #rebuild text into one string. Ensures paragraphs are formatted. The .split(" ") method will eliminate "\n" characters.
    for paragraph in paragraph_text:
        submit_text += paragraph
        submit_text += "\n\n"

    #grab first 4000 characters in submit_text and copy them into submit_chunk.
    #deletes the characters that were copied and finishes when submit_text is empty.
    while submit_text:
        adjust = 0
        submit_chunk = ""
        if len(submit_text) > 4000: #avoids out-of-bound error when at the end
            while submit_text[3999 + adjust] != " ": #make sure not to end mid-word
                adjust += 1
        submit_chunk += submit_text[:4000 + adjust]
        submit_text = submit_text[4000 + adjust:]
        edited_text.write(openai_api(key, submit_chunk))
    edited_text.close()
    return


def openai_api(key, submitted_text):
    openai.api_key = key #passed in from HTML page, or wherever
        
    prompt = "Act like a copyeditor and proofreader and edit this manuscript according to the Chicago Manual of Style. Focus on punctuation, grammar, syntax, typos, capitalization, formatting and consistency. Format all numbers according to the Chicago Manual of Style, spelling them out if necessary. Use italics and smart quotes. Ignore errors of fragmented sentences. Do not complete the end of the text. Begin here:\n\n"
    prompt += submitted_text

    chatgpt_response = openai.Completion.create(
        model="text-davinci-003", 
        prompt=prompt, 
        temperature= 0.2, 
        max_tokens=2000, top_p=1, 
        frequency_penalty=0, 
        presence_penalty=0)['choices'][0]['text'] #grab string part of response only
    return chatgpt_response

Completion endpoints have completion engines. They (and GPT itself) are designed to continue creating text that matches the form of the prior input and prior tokens seen in the output.

text-davinci-003 is the most highly-trained to follow user input instructions for various scenarios. However, it also understands the language of simpler fine-tunes based on text processing functions.

Improve quality: “”“I started by breaking up the submitted .txt files by paragraphs and making API calls with 1000 word chunks, then writing the results straight into a new .txt file. This worked pretty well, but if someone happens to submit a file that has gigantic paragraphs, then a submitted chunk of text will exceed the token limit. So now I’m trying to break text down by character count, but I think this is screwing up my results, because now GPT keeps wanting to add content to the text instead of make simple corrections. I may try using the format shown in the grammar example but I just wanted to go ahead and ask for feedback before working on that. Is there a simpler way to get back results? I could not quite figure out how to submit an entire file to the API then ask for results, so I’m trying this method for now.”“”

AI:I started by breaking up the submitted .txt files into smaller chunks and making API calls with 1000 word chunks, then writing the results straight into a new .txt file. This worked well initially, but if a file had enormous paragraphs, then the submitted chunk of text would exceed the token limit. To solve this, I am now breaking down the text by character count, but I think this is causing some errors in the results I’m getting back. I may try to use the format shown in the grammar example instead for better results, but before I do that, I wanted to get some feedback. Is there a simpler way to get back results from the API? I am not sure what the best way is to submit an entire file if I can’t do it in one call.

If you start to get in to reasoning and logic, you exceed where the model performs best. Commanding an AI that doesn’t follow commands is the problem here, along with the lower performance in general for all models when the input size grows larger.

lots of direct instructions to somebody that doesn’t exist, that do what the AI already does:

rewritten pattern

If you want to “chat” about your operations to be done, you must use a chat endpoint and a chat model like gpt-3.5-turbo.

1 Like

I see…

Okay, I tried restructuring the prompt how you suggested, and the result is much better. It’s still editing a bit too much–cutting out entire sentences without rewriting them. But it’s better. I will have to work on it a bit to see if I can get it tuned just right. Maybe temperature should just be 0?

I see that the chat completion uses GPT 3.5, which is fine. But I’m not so sure this approach is necessary, as copy editing shouldn’t require training the model on my specific sample. I just want it to use its past training on a style manual, as this is only for finding typos, missing commas, etc. The only reason I was considering using it was to have a clear separation of instructions and the sample text to be edited.

Any advice on doing this by submitting a file to the API? Or would that be pointless for the same reason (as there is no need to train the engine on the file’s specific style)? I tried installing and using LangChain, but I could not even get Python to recognize that the library was installed. (Spent like two hours trying to figure out the problem and got nowhere.)

A vector database does nothing for this particular case. It could only provide random examples of writing - while you could be providing specific examples of writing “in the style of” (although that technique doesn’t really work well anyway.)

There is no “submitting the file”. The API shows you exactly how much the model context length can understand, without obfuscation.

You want segments rewritten well? Isolate paragraphs or a few paragraphs if there is inter-paragraph context to inform the rewriting. As it seems you are doing.

If the default behavior of davinci seems brief, the output could be lengthened by more guidelines of what the copy editor will do.

gpt-3.5-turbo likes a specific length, and also is now crippled to repeat back almost exactly the form of what it got. That might work for you if your input is like the preferred output length. gpt-4 is particularly nerfed in this regard; 4000 tokens input “rewritten” is almost a summary back at you.

Another enhancement you can do is not over-specify the max_token. That takes away from the amount that can be input. If for a 4000 token model like davinci you are providing 2000 tokens of prompting plus text, 2000 max_tokens output gives enough to just slightly expand the text. In code, you can handle an input that goes over by catching the error, and then trying to delineate you huge paragraph in the middle into two separate rewrites - which probably needs to be split somehow anyway.

1 Like

@jtmoncri were you able to improve the editing behavior any better. I am wondering if any latest versions or enhancements to the models can produce better results now.