How to generate automatically applicable file diffs with ChatGPT?

Have any of you succeeded to have ChatGPT output suggested changes to a file in a way that can be automatically applied to the file?

Background: I experimented with that for a script to which you can give a file and a prompt what to change / extend in that file, which is using the ChatGPT completion API. That’s something you can nicely use as an external tool in IntelliJ or for trying automated development. But the only reliable way I could come up with so far is to make ChatGPT output the whole modified file again in a codeblock, which can be extracted and written back to the file (pending manual inspection, of course, with the following prompt fragment:

Please give a high level description in flowing text what changes to the first given file (main file) $mainfile are needed to fulfill your task and why they are necessary - e.g. what classes, identifiers, functions you need to add or modify, but absolutely no blocks.
Then output exactly one codeblock with the new contents of the whole file $mainfile surrounded by triple backticks, changed as described.
Important: your response must only include exactly one codeblock, not more, not less, except if you feel there is an error in your task and it cannot be fulfilled!

That does work, but that is obviously a waste of tokens that limits the size of the file and also sometimes encourages changes that aren’t really needed.

If you use the chat interface to ask for the necessary changes to a program, ChatGPT outputs the changes in a way that can easily applied manually and are much briefer (“… add the function XYZ …”) but that’d be hard to process automatically. I tried to have it output the changes in patch file format or as unified diffs, but couldn’t get even ChatGPT-4 to produce something useable. Having it output line numbers to change didn’t work reliably, too.

Do you have any ideas? Specifically, I’m wondering what kind of format I could get ChatGPT to output so that only the differences are given and a script / program could change the file accordingly, while it’s reasonably sure that the result is what ChatGPT actually meant. (Something like ‘Delete lines 35-39 and insert this instead…’ might work, but would produce unnoticed broken output if the line numbers are off, which they usually are. That’s why I first tried the unified diff format that has some context lines.) If possible, I’d like to work with the 3.5-turbo model.

Thanks a lot for any ideas!


1 Like

It might be helpful to take a look at editing

What I know is the capable models who have a high degree of comprehension can usually handle the editing. Basically when I’m trying to evaluate a model by one prompt, I would try editing rather than asking a question.

1 Like

Hi @kevin6 !

Thank you for your reply! Unfortunately, it seems the editing interface is only supported by older models like code-davinci-edit-001 and text-davinci-edit-001 which are marked as deprecated, and not something more capable like gpt-3.5-turbo . I tried it a little in the playground, and did some quite strange things to the code for tasks that gpt-3.5-turbo was easily able to do. So I still hope to find something doable with it’s chat interface.

Best regards,

I’ve attempted a few ways to make ChatGPT generate diffs that can be directly applied via the patch shell command.

Theoretically, the diff format would be ideal:

--- /path/to/original	timestamp
+++ /path/to/new	timestamp
@@ -1,3 +1,9 @@
+This is an important
+notice! It should
+therefore be located at
+the beginning of this
 This part of the
 document has stayed the
 same from version to
@@ -8,13 +14,8 @@
 compress the size of the

-This paragraph contains
-text that is outdated.
-It will be deleted in the
-near future.
 It is important to spell
-check this dokument. On
+check this document. On
 the other hand, a
 misspelled word isn't
 the end of the world.
@@ -22,3 +23,7 @@
 this paragraph needs to
 be changed. Things can
 be added after it.
+This paragraph contains
+important new additions
+to this document.

The main issue is that GPT models, due to how it is implemented with tokens, cannot accurately determine the positions in the document, and, therefore, not write a valid diff. I have attempted to provide line numbers on the source code that I send into the prompt, and with that, it gets better (but still not enough), and then it may sometimes decide to write code with line numbers.
We would need a widely known diff format that doesn’t rely on line and column numbers (which doesn’t exist, to my knowledge).

For code, the only fully working method I could devise so far was to steer it to write python code that will completely overwrite the files with the code changes. I have an example where this is done on typescript files. Because I cannot include links in posts (why not?!), you have to look for the “jupyter-notebook-chatcompletion” repo on github and look for the file
test/notebooks/more-accurate-token-estimates.ipynb , where I used GPT-4 to implement the changes I wanted.

As you can see in that Jupyter Notebook, Example 1 and Example 2 are referred to in Example 3 to ensure it writes Python code that will apply the desired file changes. It works even better when you do an actual few-shot prompt with 3-4 examples (in which case you don’t instruct anything), but that eats up so many tokens that it doesn’t leave that much space for actual code. So I have, unfortunately, to rely on an instruction like “[…] and apply the changes by overwriting the files like you did in Example 1 and 2”.

Note that at least, in the case of structured files like JSON, YAML and XML, it will be smart enough to read the document into its respective object model, apply only the changes and then overwrite the file with those changes.

There’s a compromise that might work but I don’t have the time to try yet. Just like ChatGPT understand that with JSON that it can deserialize the file, manipulate the resulting object and then serialize that object again, one could write a library that deserializes code files into something that can be manipulated and serialized again. So for example if ChatGPT wanted to overwrite a specific function, it could do something like:

// change the code within the function
code["function sayHello(message : string)")] = "print('lol')"

So basically, you need a document object model for code. The problem here again is that nothing like this was established before 2021, so one would need to provide extensive examples the prompt - which defeats the purpose of trying to reduce the number of tokens produced by only changing the delta (in comparison to completely overwriting the files).


Actually, the unified diff format or context diff format for patch are somewhat independent of the line numbers. You can often apply a patch even on a modified file, because it uses the context lines to find the right place. The main problem I saw with it is that it often messes up the indentation, because in the patch format it has to add a space of indentation to the unchanged context lines, and a + or - or ! to changed lines. That seems to be hard even for ChatGPT-4.

So, perhaps we could patch a patch implementation to ignore leading whitespace for finding the context lines, but that might still have broken whitespace on the added lines… Or we’d have to train a network to deal with ChatGPT’s humanized diffs, or make a program that discusses with gpt-3.5-turbo what gpt-4’s output actually means. :smile: If we come up with something else, that might still be worth it if the description is short enough.

BTW @riemaecker: I also wanted to add a link to my tools with the script and the prompt file and failed, but that works now. It seems they have some clever anti spamming measures in place so that you can only post links only when you have completed the chatbot’s tutorial and have been deemed worthy by someone to reply to you, or something.

1 Like

I found a way to solve this for my problem. I use inheritance overrides.


Here is my code for a decision tree classifier.

class DecisionNode:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left

    ...lotsa code

            return self.make_prediction(x, tree.left)
            return self.make_prediction(x, tree.right)

I want to implement Exponential Entropy / Tsallis Entropy: These are generalizations of the standard Shannon entropy that introduce an additional parameter. They might provide interesting results in some cases, but they also complicate the interpretation of the results.

Please do it in a subclass

class DecisionTreeClassifierTsallis(DecisionTreeClassifier):
    def __init__(self, min_samples_split=2, max_depth=2, q=1.5):
        super().__init__(min_samples_split=min_samples_split, max_depth=max_depth)
        self.q = q
    def tsallis_entropy(self, y):
        class_labels = np.unique(y)
        entropy = 0
        for cls in class_labels:
            p_cls = len(y[y == cls]) / len(y)
            if self.q == 1:
                entropy += -p_cls * np.log2(p_cls)
                entropy += (p_cls**self.q - p_cls) / (1 - self.q)
        return entropy
    # Overriding the information_gain method to use Tsallis entropy
    def information_gain(self, parent, l_child, r_child):
        weight_l = len(l_child) / len(parent)
        weight_r = len(r_child) / len(parent)
        gain = self.tsallis_entropy(parent) - (weight_l * self.tsallis_entropy(l_child) + weight_r * self.tsallis_entropy(r_child))
        return gain

No diff required and it seems to handle it well. To avoid rewriting functions like the information_gain above, one could structure things with a bit more composition in mind.

Pretty obvious in retrospect, I know.


1 Like

Actually, the unified diff format or context diff format for patch are somewhat independent of the line numbers. You can often apply a patch even on a modified file, because it uses the context lines to find the right place.

Oh, thanks for the insights. That changes a lot of things. I reviewed my tests manually and didn’t try to patch any diff where the lines/columns didn’t seem correct, so I never knew it could still work!

I think I’m going to give this another round next week. Thanks for the links.

@qrdl Of course that would work in this one case, but this is not widely applicable. We want the LLM to do a little more here than being able to add code, hide or override members. You wouldn’t subclass something to fix an error or to refactor code, and you wouldn’t remove code, add documentation with subclassing. And then besides code we have lots of configuration files, json, xml, yaml and etc, and nowadays every language/framework has at least one of those at the project’s root.

1 Like

I’ve played around with this a bit while building my DesktopGPT plugin that connects ChatGPT to your local files and applications.

The most reliable (but still finicky) way that I found is with simple find and replace. This obviously has its own downsides (like replacing parts of the code you didn’t intend to), but this was never an issue for me since there was always enough code context in the search string to uniquely identify the part of the code that needs fixing.

Here is an example from my early testing.

The commands that ChatGPT issued in the last response were:

  "directoryName": "file:///Users/dounan/dev/desktop-gpt/mac/DesktopGPT/DesktopGPT/",
  "filePath": "RemoteControlViewModel.swift",
  "searchString": "        webSocketTask.resume()\n        self.webSocketTask = webSocketTask\n        \n        listenForCommands()\n    }",
  "replacementString": "        webSocketTask.resume()\n        self.webSocketTask = webSocketTask\n        \n        handleConnectionEstablished()\n        listenForCommands()\n    }\n    \n    private func handleConnectionEstablished() {\n        DispatchQueue.main.async {\n            self.status = \"Connected\"\n        }\n    }"
  "directoryName": "file:///Users/dounan/dev/desktop-gpt/mac/DesktopGPT/DesktopGPT/",
  "filePath": "RemoteControlViewModel.swift",
  "searchString": "                    self?.showErrorNotification(message: \"Connection failed: \\(error.localizedDescription)\")\n                }\n            }\n        }\n    }",
  "replacementString": "                    self?.showErrorNotification(message: \"Connection failed: \\(error.localizedDescription)\")\n                    self?.retryConnection()\n                }\n            }\n        }\n    }\n    \n    private func retryConnection() {\n        DispatchQueue.main.asyncAfter(deadline: .now() + retryInterval) { [weak self] in\n            self?.connectToServer()\n        }\n        // Increase the retry interval for the next attempt, up to a maximum of 10 seconds.\n        retryInterval = min(retryInterval * 2, 10.0)\n    }\n    \n    private var retryInterval: TimeInterval = 1.0"

Oh for sure. Though engineering by composition isn’t a terrible crime.

Actually, one idea that might work is have it generate the code in full and then send it back to generate the diff. Lots of tokens, but it might work.

@dounan The problem with forced encoding is that it degrades reasoning. For simple changes this will work, but for more complex edits there is not enough attention to go around, it seems. This is why I prefer the override approach because it fits well with the reasoning being performed and doesn’t distract as much.

That said, by using multiple round trips you would be able to deal with this issue. Might even be able to use a less expensive model.

There is one interesting point I’ve seen made and also experienced myself. If you force the AI to conform to a certain format, that’ll need a part of it’s reasoning capacities just for that. Since the patch format with it’s additional indention is troublesome, I tried to have it just add comments like
(inserted code)
and similarily for deletions, and have a few context lines. Or to have prepare for @dounan 's search and replace strategy by citing the original and the changed code. In both cases the responses seemed sensible so that automatically processing them just might work, but there had been aspects of my coding task that ChatGPT overlooked. But without any such additional constraints it could solve the task. So maybe adding another AI driven step to “translate” the changes into actual code changes would be the best way to go, and something probably ChatGPT-3.5 would be very well capable of.


Yep, I mentioned this a few times above. I feel that this is a very critical point to keep in mind when doing this.

My suggestion was to use multiple round trips and separate out encoding from reasoning / code generation. Admittedly, this can get expensive, but probably worth it for higher quality code generation.

So maybe adding another AI driven step to “translate” the changes into actual code changes would be the best way to go, and something probably ChatGPT-3.5 would be very well capable of.

Great minds think alike :slight_smile:

I would encourage folks to go back and revisit the concept of using composition. It helps ensure you don’t get caught up in the enfeeblement problem. Composition is also a great way to engineer, as well, as it’s very decoupled and testable.

I’ve been using this to great success with GPT4. If I know exactly how something should be implemented, I delegate to GPT4, but otherwise I do it myself.

One thing that is interesting is that you find you really need to guide things. You can’t rely on GPT4 to decompose properly.

It really is a great way to stay in touch with the code, but still leverage GPT4 as much as possible.

Knowing how things work is what makes you valuable as an Engineer. Don’t forget that.

I’m working on this problem as well. Unfortunately I have not yet had success in obtaining access to the GPT4 API so all of my efforts have been focused on GPT3.5.

I’ve had some luck by simply attempting to apply the patchfile. If any error is encountered, I return it to the model and ask it to generate a new patch. Once I have a patchfile that applies successfully, I can supply the model with the edited portions of the file, with some lookback and lookahead. Then the model confirms whether the output is as expected.

While this works for series of small changes, it is not reliable. I plan to continue iterating on this.

I’ve had the same problem that is mentioned in the original post with diffs.

I ended up making a lot of ethology with GPT. Even if you can ask GPT to always display the full code, GPT will eventually write placeholders and reference to old code, which is a good way to save on tokens. So let’s say you are writing a 1000 words javascript file, it will sometime refer to old code that way:

function myFunction() {
  // Code remains the same

function myNewFunction() {
  let someNewCode = 12;

Concretely, you will often obtain that if you diff what you had before, and what you have after:

function myFunction() {
-   let a = 2;
-   let b = 3;
+  // Code remains the same

This can be characterized as a hunk where the bounding lines remain the same and the content goes from whatever to a single line starting with a comment.

I cannot circumvent this problem even with 300 out of my 1000 words prompt that are about writing complete and full code, so I think it may be a part of what is hardcoded in the model.

Common problems when generating code autonomously are:

  • This problem with diffs (because it can’t retain the position of words)
  • Placeholders that refer to previous code (probably kindly hardcoded to save in tokens)
  • Unimplemented stuff (±5% of methods), even when I autodetected it and re-asked again and again
  • Files in the wrong folder, even if the file structure and current folder are displayed again and again
  • Not reading file before writing them, even asked to.
1 Like


Looks like you put some thought into this.

What are your thoughts on degraded reasoning about code when conflated with encoding, eg: generating diffs?

Two solutions discussed above is to separate out code generation with diff generation, or my preferred, which is to utilize composition.

Hi! Yeah. I put a LOT of time on this and started to go to bed late.

Your thoughts are interesting. What you call “enfeeblement” is happening randomly, and I’ve tried to fight it a lot.

Actually, I’ve found an interesting approach to fix code most of the code. I have a function that is searching for unimplemented code using hardcoded hints like “// Previous code here”, etc. Actually GPT is mostly using always the same placeholders.

If some unimplemented code is found, then I have a prompt where I’m asking to merge the old code (saved on disk) and the new code (just generated), which is a rather easy task for GPT-4. When doing that, GPT has everything it needs to rebuild the whole file.

// First check: try to repair the file if there are any regression.
const referencesToPreviousCode = findReferencesToPreviousCode(nextFileContent);

if (referencesToPreviousCode.length) {
    console.log('Warning: Code has been rebuilt from a merge between previous file and incomplete GPT.');
    nextFileContent = await codeRebuilder(previousFileContent, nextFileContent);

The codeRebuilder prompt is like so:

Combine the Version 1 and Version 2 of the content of ‘file1.js’ as provided below by mashing up the two files together, maintaining the full content from both files by replacing placeholders that reference to Previous version 1 content with the actual previous content. Remove comments about reference to previous code. Produce a file that is a combination of version 1 and version 2 file content. But in the Resulting file, all comments that are referring to the code from the previous file content should be actually replaced with the real previous code.\n\Version 1 file:\n\n${previousCode}\n\Version 2 file:\n\n${newCode}\n\nPrint the Version 3 final content which is the mashup of both files.

It seems aider has the nice interesting idea to (ab-)use the conflict marker format, which seems well known to the model:

<<<<<<< ORIGINAL
original lines
modified lines
>>>>>>> UPDATED

That might be obsolete with the new function calls, but there might be some advantages, as function calls require the model to JSON encode and you can’t have it generate explanatory prose before function call, which would give it some “time to think”, and there can be a couple of such changes in one message.

In the context of a plugin I tried to give it an operation that replaces a regex that is matched on the whole file - which allows (in theory) quite compactly representing such operations. In some cases that worked as a charm, but ChatGPT-4 was bad when it came to multiline patterns, group references in the replacement often trashed the indentation when they worked properly at all, and I built in a functionality to reject patterns that match more than one place in a file because it kept replacing too much otherwise. So I’m wondering whether I’ll take that out completely in favor of a simple search and replace. :rofl:

I struggled with this problem as well. I will share with you my best effort result so far, and why it is still not enough.

To the best of my knowledge (very limited at this point), the only reliable way to iterate on code is to have ChatGPT send over the entire file content. We can maybe instruct GPT to aggressively refactor code into as many files as possible to save some cost.

My best-effort solution so far

Here is ChatGPT’s summary of the protocol we’ve developed for machine readable code iterations. Of course the final output will need to be further made into machine readable formats. This I haven’t done yet.

Code Change Protocol:

  1. Versioning:
    Each change or set of changes is associated with a version, which helps in tracking modifications over time.
  • We started with a base version (e.g., Version 1.0.0).
  • Every subsequent change increases the version (e.g., Version 1.1.0, Version 1.2.0, and so on).
  1. Change Description:
    A brief description is provided to summarize what the code change involves.
  2. From and To Versions:
    Clearly state the version you’re transitioning from and the version you’re transitioning to, e.g.,
    From Version: 1.1.0
    To Version: 1.2.0
  3. File-wise Changes:
  • For each file that has modifications, the file name is mentioned.
  • Under each file, the changes are specified.
  1. Exact Code Blocks:
    For each modification:
  • Old Code Block: The exact lines of code from the current version that need to be modified or replaced are presented with proper indentation.
  • New Code Block: The new lines of code that should replace or modify the old code are provided with proper indentation.
  • If applicable, the exact line numbers for the changes are mentioned, e.g., “Replace lines 11-17.”
  1. Avoiding Omissions:
  • No ellipses (...) or shorthand notations are used. Complete code blocks are presented to ensure there’s no ambiguity.
  1. Clear Instructions:
    Explicit instructions are given about what to do with the code blocks, e.g., “Add after the existing code,” “Replace the existing block,” etc.


Change Description: Introduced a delay between tile movement and the appearance of the new tile.

From Version: 1.3.0
To Version: 1.4.0

File: script.js

Replace lines 25-26 with the following blocks:

Old Code (from Version 1.3.0, lines 25-26):

javascriptCopy code

    if (JSON.stringify(board) !== original) placeRandomTile();

New Code (for Version 1.4.0, lines 25-30):

javascriptCopy code

    if (JSON.stringify(board) !== original) {
        setTimeout(() => {
        }, 200);  // 200 milliseconds delay

By following this protocol, you can get a clear and machine-friendly representation of code changes that ensures accuracy and ease of application.

Evaluation (GPT 4)

What it does well

  • Version tracking works perfectly so far.
  • Indentation in the diff is usually kept. I am sure if we remind GPT in every further prompt, the indentation requirement will be respected.

What still doesn’t work

Despite the instruction, GPT still make these mistakes

  • It can’t get line numbers right (as others have noted)
  • It still likes to omit part of the code, like Leon mentioned

I’ve tried all of this already. Versioning, forcing to fully implement code, giving clear instructions, as well as aider approach (using diff).

@hpstoerr Using diff implies numbers and cartesian logic, and GPT doesn’t perform well in these fields. It will -very often- output broken diff.

The more I look at this problem, the more I think about using a structured approach instead. i.e. Pushing a very rigid JSON schema to the function parameter in the prompt, and forcing GPT to create methods, variables, etc, that are strong because they respect this very specific JSON schema.

Let’s say your function has a pure part and an impure part, the impure part will modify variables scoped outside your function. So you need to be sure that these variables exists. Then gradually, you build something that is more and more stable.

What would be interesting is to be able to use a solid approach with a cheaper model (i.e. 3.5 instead of 4).

I’ve managed to solve the problem with a very high level of accuracy, requesting to generate a “snippet of the changes” (place some // … existing code here … comments, and the like), and then requesting in a different prompt to re-create the file with the “snippet” and the original file.

The issue is how long it takes to re-create the file of course. Planning on using all the collected (“auto-completes”) to fine-tune GPT 3.5, and hopefully, it can do the task faster (keeping the same quality :crossed_fingers: ). By itself, GPT 3.5 is not able to perform this task to reconstruct the file correctly.

Something that I found important, is that I share the “rules of the file” within the prompt that reconstructs the file so that it “knows” what was expected from the file (i.e., things like methods it should contain, writing JSDocs or PHPDocs and the like).

But @leon.pelletier idea of using a more structured approach sounds solid if we want to speed things up. I’m already using code parsers to get ASTs, so it could be interesting to give it the parsed code to some extent and request the modifications of the "JSON’ instead (pointing out some specific sections that you want to modify and requesting to output only that section as a JSON).