How to generate automatically applicable file diffs with ChatGPT?

Have any of you succeeded to have ChatGPT output suggested changes to a file in a way that can be automatically applied to the file?

Background: I experimented with that for a script to which you can give a file and a prompt what to change / extend in that file, which is using the ChatGPT completion API. That’s something you can nicely use as an external tool in IntelliJ or for trying automated development. But the only reliable way I could come up with so far is to make ChatGPT output the whole modified file again in a codeblock, which can be extracted and written back to the file (pending manual inspection, of course, with the following prompt fragment:

Please give a high level description in flowing text what changes to the first given file (main file) $mainfile are needed to fulfill your task and why they are necessary - e.g. what classes, identifiers, functions you need to add or modify, but absolutely no blocks.
Then output exactly one codeblock with the new contents of the whole file $mainfile surrounded by triple backticks, changed as described.
Important: your response must only include exactly one codeblock, not more, not less, except if you feel there is an error in your task and it cannot be fulfilled!

That does work, but that is obviously a waste of tokens that limits the size of the file and also sometimes encourages changes that aren’t really needed.

If you use the chat interface to ask for the necessary changes to a program, ChatGPT outputs the changes in a way that can easily applied manually and are much briefer (“… add the function XYZ …”) but that’d be hard to process automatically. I tried to have it output the changes in patch file format or as unified diffs, but couldn’t get even ChatGPT-4 to produce something useable. Having it output line numbers to change didn’t work reliably, too.

Do you have any ideas? Specifically, I’m wondering what kind of format I could get ChatGPT to output so that only the differences are given and a script / program could change the file accordingly, while it’s reasonably sure that the result is what ChatGPT actually meant. (Something like ‘Delete lines 35-39 and insert this instead…’ might work, but would produce unnoticed broken output if the line numbers are off, which they usually are. That’s why I first tried the unified diff format that has some context lines.) If possible, I’d like to work with the 3.5-turbo model.

Thanks a lot for any ideas!


It might be helpful to take a look at editing

What I know is the capable models who have a high degree of comprehension can usually handle the editing. Basically when I’m trying to evaluate a model by one prompt, I would try editing rather than asking a question.

1 Like

Hi @kevin6 !

Thank you for your reply! Unfortunately, it seems the editing interface is only supported by older models like code-davinci-edit-001 and text-davinci-edit-001 which are marked as deprecated, and not something more capable like gpt-3.5-turbo . I tried it a little in the playground, and did some quite strange things to the code for tasks that gpt-3.5-turbo was easily able to do. So I still hope to find something doable with it’s chat interface.

Best regards,

I’ve attempted a few ways to make ChatGPT generate diffs that can be directly applied via the patch shell command.

Theoretically, the diff format would be ideal:

--- /path/to/original	timestamp
+++ /path/to/new	timestamp
@@ -1,3 +1,9 @@
+This is an important
+notice! It should
+therefore be located at
+the beginning of this
 This part of the
 document has stayed the
 same from version to
@@ -8,13 +14,8 @@
 compress the size of the

-This paragraph contains
-text that is outdated.
-It will be deleted in the
-near future.
 It is important to spell
-check this dokument. On
+check this document. On
 the other hand, a
 misspelled word isn't
 the end of the world.
@@ -22,3 +23,7 @@
 this paragraph needs to
 be changed. Things can
 be added after it.
+This paragraph contains
+important new additions
+to this document.

The main issue is that GPT models, due to how it is implemented with tokens, cannot accurately determine the positions in the document, and, therefore, not write a valid diff. I have attempted to provide line numbers on the source code that I send into the prompt, and with that, it gets better (but still not enough), and then it may sometimes decide to write code with line numbers.
We would need a widely known diff format that doesn’t rely on line and column numbers (which doesn’t exist, to my knowledge).

For code, the only fully working method I could devise so far was to steer it to write python code that will completely overwrite the files with the code changes. I have an example where this is done on typescript files. Because I cannot include links in posts (why not?!), you have to look for the “jupyter-notebook-chatcompletion” repo on github and look for the file
test/notebooks/more-accurate-token-estimates.ipynb , where I used GPT-4 to implement the changes I wanted.

As you can see in that Jupyter Notebook, Example 1 and Example 2 are referred to in Example 3 to ensure it writes Python code that will apply the desired file changes. It works even better when you do an actual few-shot prompt with 3-4 examples (in which case you don’t instruct anything), but that eats up so many tokens that it doesn’t leave that much space for actual code. So I have, unfortunately, to rely on an instruction like “[…] and apply the changes by overwriting the files like you did in Example 1 and 2”.

Note that at least, in the case of structured files like JSON, YAML and XML, it will be smart enough to read the document into its respective object model, apply only the changes and then overwrite the file with those changes.

There’s a compromise that might work but I don’t have the time to try yet. Just like ChatGPT understand that with JSON that it can deserialize the file, manipulate the resulting object and then serialize that object again, one could write a library that deserializes code files into something that can be manipulated and serialized again. So for example if ChatGPT wanted to overwrite a specific function, it could do something like:

// change the code within the function
code["function sayHello(message : string)")] = "print('lol')"

So basically, you need a document object model for code. The problem here again is that nothing like this was established before 2021, so one would need to provide extensive examples the prompt - which defeats the purpose of trying to reduce the number of tokens produced by only changing the delta (in comparison to completely overwriting the files).


Actually, the unified diff format or context diff format for patch are somewhat independent of the line numbers. You can often apply a patch even on a modified file, because it uses the context lines to find the right place. The main problem I saw with it is that it often messes up the indentation, because in the patch format it has to add a space of indentation to the unchanged context lines, and a + or - or ! to changed lines. That seems to be hard even for ChatGPT-4.

So, perhaps we could patch a patch implementation to ignore leading whitespace for finding the context lines, but that might still have broken whitespace on the added lines… Or we’d have to train a network to deal with ChatGPT’s humanized diffs, or make a program that discusses with gpt-3.5-turbo what gpt-4’s output actually means. :smile: If we come up with something else, that might still be worth it if the description is short enough.

BTW @riemaecker: I also wanted to add a link to my tools with the script and the prompt file and failed, but that works now. It seems they have some clever anti spamming measures in place so that you can only post links only when you have completed the chatbot’s tutorial and have been deemed worthy by someone to reply to you, or something.

1 Like

I found a way to solve this for my problem. I use inheritance overrides.


Here is my code for a decision tree classifier.

class DecisionNode:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left

    ...lotsa code

            return self.make_prediction(x, tree.left)
            return self.make_prediction(x, tree.right)

I want to implement Exponential Entropy / Tsallis Entropy: These are generalizations of the standard Shannon entropy that introduce an additional parameter. They might provide interesting results in some cases, but they also complicate the interpretation of the results.

Please do it in a subclass

class DecisionTreeClassifierTsallis(DecisionTreeClassifier):
    def __init__(self, min_samples_split=2, max_depth=2, q=1.5):
        super().__init__(min_samples_split=min_samples_split, max_depth=max_depth)
        self.q = q
    def tsallis_entropy(self, y):
        class_labels = np.unique(y)
        entropy = 0
        for cls in class_labels:
            p_cls = len(y[y == cls]) / len(y)
            if self.q == 1:
                entropy += -p_cls * np.log2(p_cls)
                entropy += (p_cls**self.q - p_cls) / (1 - self.q)
        return entropy
    # Overriding the information_gain method to use Tsallis entropy
    def information_gain(self, parent, l_child, r_child):
        weight_l = len(l_child) / len(parent)
        weight_r = len(r_child) / len(parent)
        gain = self.tsallis_entropy(parent) - (weight_l * self.tsallis_entropy(l_child) + weight_r * self.tsallis_entropy(r_child))
        return gain

No diff required and it seems to handle it well. To avoid rewriting functions like the information_gain above, one could structure things with a bit more composition in mind.

Pretty obvious in retrospect, I know.


1 Like

Actually, the unified diff format or context diff format for patch are somewhat independent of the line numbers. You can often apply a patch even on a modified file, because it uses the context lines to find the right place.

Oh, thanks for the insights. That changes a lot of things. I reviewed my tests manually and didn’t try to patch any diff where the lines/columns didn’t seem correct, so I never knew it could still work!

I think I’m going to give this another round next week. Thanks for the links.

@qrdl Of course that would work in this one case, but this is not widely applicable. We want the LLM to do a little more here than being able to add code, hide or override members. You wouldn’t subclass something to fix an error or to refactor code, and you wouldn’t remove code, add documentation with subclassing. And then besides code we have lots of configuration files, json, xml, yaml and etc, and nowadays every language/framework has at least one of those at the project’s root.

1 Like

I’ve played around with this a bit while building my DesktopGPT plugin that connects ChatGPT to your local files and applications.

The most reliable (but still finicky) way that I found is with simple find and replace. This obviously has its own downsides (like replacing parts of the code you didn’t intend to), but this was never an issue for me since there was always enough code context in the search string to uniquely identify the part of the code that needs fixing.

Here is an example from my early testing.

The commands that ChatGPT issued in the last response were:

  "directoryName": "file:///Users/dounan/dev/desktop-gpt/mac/DesktopGPT/DesktopGPT/",
  "filePath": "RemoteControlViewModel.swift",
  "searchString": "        webSocketTask.resume()\n        self.webSocketTask = webSocketTask\n        \n        listenForCommands()\n    }",
  "replacementString": "        webSocketTask.resume()\n        self.webSocketTask = webSocketTask\n        \n        handleConnectionEstablished()\n        listenForCommands()\n    }\n    \n    private func handleConnectionEstablished() {\n        DispatchQueue.main.async {\n            self.status = \"Connected\"\n        }\n    }"
  "directoryName": "file:///Users/dounan/dev/desktop-gpt/mac/DesktopGPT/DesktopGPT/",
  "filePath": "RemoteControlViewModel.swift",
  "searchString": "                    self?.showErrorNotification(message: \"Connection failed: \\(error.localizedDescription)\")\n                }\n            }\n        }\n    }",
  "replacementString": "                    self?.showErrorNotification(message: \"Connection failed: \\(error.localizedDescription)\")\n                    self?.retryConnection()\n                }\n            }\n        }\n    }\n    \n    private func retryConnection() {\n        DispatchQueue.main.asyncAfter(deadline: .now() + retryInterval) { [weak self] in\n            self?.connectToServer()\n        }\n        // Increase the retry interval for the next attempt, up to a maximum of 10 seconds.\n        retryInterval = min(retryInterval * 2, 10.0)\n    }\n    \n    private var retryInterval: TimeInterval = 1.0"

Oh for sure. Though engineering by composition isn’t a terrible crime.

Actually, one idea that might work is have it generate the code in full and then send it back to generate the diff. Lots of tokens, but it might work.

@dounan The problem with forced encoding is that it degrades reasoning. For simple changes this will work, but for more complex edits there is not enough attention to go around, it seems. This is why I prefer the override approach because it fits well with the reasoning being performed and doesn’t distract as much.

That said, by using multiple round trips you would be able to deal with this issue. Might even be able to use a less expensive model.

There is one interesting point I’ve seen made and also experienced myself. If you force the AI to conform to a certain format, that’ll need a part of it’s reasoning capacities just for that. Since the patch format with it’s additional indention is troublesome, I tried to have it just add comments like
(inserted code)
and similarily for deletions, and have a few context lines. Or to have prepare for @dounan 's search and replace strategy by citing the original and the changed code. In both cases the responses seemed sensible so that automatically processing them just might work, but there had been aspects of my coding task that ChatGPT overlooked. But without any such additional constraints it could solve the task. So maybe adding another AI driven step to “translate” the changes into actual code changes would be the best way to go, and something probably ChatGPT-3.5 would be very well capable of.


Yep, I mentioned this a few times above. I feel that this is a very critical point to keep in mind when doing this.

My suggestion was to use multiple round trips and separate out encoding from reasoning / code generation. Admittedly, this can get expensive, but probably worth it for higher quality code generation.

So maybe adding another AI driven step to “translate” the changes into actual code changes would be the best way to go, and something probably ChatGPT-3.5 would be very well capable of.

Great minds think alike :slight_smile:

I would encourage folks to go back and revisit the concept of using composition. It helps ensure you don’t get caught up in the enfeeblement problem. Composition is also a great way to engineer, as well, as it’s very decoupled and testable.

I’ve been using this to great success with GPT4. If I know exactly how something should be implemented, I delegate to GPT4, but otherwise I do it myself.

One thing that is interesting is that you find you really need to guide things. You can’t rely on GPT4 to decompose properly.

It really is a great way to stay in touch with the code, but still leverage GPT4 as much as possible.

Knowing how things work is what makes you valuable as an Engineer. Don’t forget that.

I’m working on this problem as well. Unfortunately I have not yet had success in obtaining access to the GPT4 API so all of my efforts have been focused on GPT3.5.

I’ve had some luck by simply attempting to apply the patchfile. If any error is encountered, I return it to the model and ask it to generate a new patch. Once I have a patchfile that applies successfully, I can supply the model with the edited portions of the file, with some lookback and lookahead. Then the model confirms whether the output is as expected.

While this works for series of small changes, it is not reliable. I plan to continue iterating on this.

I’ve had the same problem that is mentioned in the original post with diffs.

I ended up making a lot of ethology with GPT. Even if you can ask GPT to always display the full code, GPT will eventually write placeholders and reference to old code, which is a good way to save on tokens. So let’s say you are writing a 1000 words javascript file, it will sometime refer to old code that way:

function myFunction() {
  // Code remains the same

function myNewFunction() {
  let someNewCode = 12;

Concretely, you will often obtain that if you diff what you had before, and what you have after:

function myFunction() {
-   let a = 2;
-   let b = 3;
+  // Code remains the same

This can be characterized as a hunk where the bounding lines remain the same and the content goes from whatever to a single line starting with a comment.

I cannot circumvent this problem even with 300 out of my 1000 words prompt that are about writing complete and full code, so I think it may be a part of what is hardcoded in the model.

Common problems when generating code autonomously are:

  • This problem with diffs (because it can’t retain the position of words)
  • Placeholders that refer to previous code (probably kindly hardcoded to save in tokens)
  • Unimplemented stuff (±5% of methods), even when I autodetected it and re-asked again and again
  • Files in the wrong folder, even if the file structure and current folder are displayed again and again
  • Not reading file before writing them, even asked to.
1 Like


Looks like you put some thought into this.

What are your thoughts on degraded reasoning about code when conflated with encoding, eg: generating diffs?

Two solutions discussed above is to separate out code generation with diff generation, or my preferred, which is to utilize composition.

Hi! Yeah. I put a LOT of time on this and started to go to bed late.

Your thoughts are interesting. What you call “enfeeblement” is happening randomly, and I’ve tried to fight it a lot.

Actually, I’ve found an interesting approach to fix code most of the code. I have a function that is searching for unimplemented code using hardcoded hints like “// Previous code here”, etc. Actually GPT is mostly using always the same placeholders.

If some unimplemented code is found, then I have a prompt where I’m asking to merge the old code (saved on disk) and the new code (just generated), which is a rather easy task for GPT-4. When doing that, GPT has everything it needs to rebuild the whole file.

// First check: try to repair the file if there are any regression.
const referencesToPreviousCode = findReferencesToPreviousCode(nextFileContent);

if (referencesToPreviousCode.length) {
    console.log('Warning: Code has been rebuilt from a merge between previous file and incomplete GPT.');
    nextFileContent = await codeRebuilder(previousFileContent, nextFileContent);

The codeRebuilder prompt is like so:

Combine the Version 1 and Version 2 of the content of ‘file1.js’ as provided below by mashing up the two files together, maintaining the full content from both files by replacing placeholders that reference to Previous version 1 content with the actual previous content. Remove comments about reference to previous code. Produce a file that is a combination of version 1 and version 2 file content. But in the Resulting file, all comments that are referring to the code from the previous file content should be actually replaced with the real previous code.\n\Version 1 file:\n\n${previousCode}\n\Version 2 file:\n\n${newCode}\n\nPrint the Version 3 final content which is the mashup of both files.

It seems aider has the nice interesting idea to (ab-)use the conflict marker format, which seems well known to the model:

<<<<<<< ORIGINAL
original lines
modified lines
>>>>>>> UPDATED

That might be obsolete with the new function calls, but there might be some advantages, as function calls require the model to JSON encode and you can’t have it generate explanatory prose before function call, which would give it some “time to think”, and there can be a couple of such changes in one message.

In the context of a plugin I tried to give it an operation that replaces a regex that is matched on the whole file - which allows (in theory) quite compactly representing such operations. In some cases that worked as a charm, but ChatGPT-4 was bad when it came to multiline patterns, group references in the replacement often trashed the indentation when they worked properly at all, and I built in a functionality to reject patterns that match more than one place in a file because it kept replacing too much otherwise. So I’m wondering whether I’ll take that out completely in favor of a simple search and replace. :rofl: