Looking for best practices to build a GitHub Copilot clone (DIY code completion)?

As part of my research, I am considering building my own prototypical AI-based code completion tool, basically a clone of GitHub Copilot. That is, as the user starts typing code into an editor, “ghost text” completions should appear next to the cursor.

I am working in a system where the original is not available, so reuse is not an option. However, I find this task very challenging for several reasons. Below, I am sharing some of my open issues. I would love to hear from you whether you have any experiences/best practices/evidence, related work/other resources, or other ideas to share.

Choosing The Right Model

A few years ago, the best choice for this task would probably have been the Codex model. However, it now has been deprecated, the completion API has been labeled as legacy, and the official recommendation is to use chat completions instead. So, I am currently using the latest GPT-4 Turbo model. Its latency, costs, and ability to strictly complete the existing code are far from ideal, however (see prompting below).

Are there any other models that might be at least as suitable as GPT-4 for my task, but most importantly, faster and cheaper? Unfortunately, I also do not have access to high-end GPU resources …


A simple prompt for a chat completion model could look like this:

System: You are a code completion assistant tool for the language XYZ. You will be provided an existing snippet typed in by the user, plus a couple of relevant existing methods. Your task is to provide a completion of the existing snippet.

System: Existing snippet typed by the user:

User: function parseDate(string) {

System: Similar methods:

User: function parseTime(string) {
    const timePattern = /^(\d{2}):(\d{2})(?::(\d{2}))?$/;
    const match = string.match(timePattern);
    const hours = parseInt(match[1], 10);
    const minutes = parseInt(match[2], 10);
    const seconds = match[3] ? parseInt(match[3], 10) : undefined;
    return {
        hours: hours,
        minutes: minutes,
        seconds: seconds

System: Now answer the completion of the user snippet.

This works to some extent and yields completions of usable quality, but the conversational style is obviously less ideal for pattern completion than the classical completion models. For instance, sometimes the assistant repeats the provided prefix snippet and sometimes only prints out the completion. Sometimes it formats the code as Markdown with backticks (```), sometimes it answers the plain code etc. Instructions like “Only answer the new code” did not work too reliably for me. Currently, I am postprocessing the response to extract the actual completion, but this seems not ideal in terms of reliability and speed/token usage (see below).

Instructing the model to have the completion end in a given suffix (i.e., existing text right to the user’s cursor in the editor, which was supported through the classical completion models) also did not work reliably for me.

Fine-tuning might be another way to educate the model a) for the current task and b) about the existing language and framework. However, fine-tuning for GPT-4 is currently in closed beta, and in general, costs for fine-tuning and especially using the fine-tuned models look very high.

Chain of Thought

CoT prompts (i.e., “think aloud before you print the completion” or “explain your solution”) often seem to yield better results. However, they increase the effort for post-processing the response (see above) and, more importantly, reduce the speed/increase the token usage significantly (see below).

Providing Context

There are two reasons for providing more context than just the current line or method stub to the model:

  1. Reuse existing methods and patterns from the current solution: E.g., helper methods defined above, see the big picture of the current method, apply existing coding styles.
  2. Know the existing language and framework: Less likely an issue when asking the model to use regular expressions in JS since both are highly popular techniques, but in my situation I am working with a more niche system (Smalltalk) of which the model has little knowledge about both the syntax and the available framework and tends too much to hallucination.

So, I am inserting additional context into the prompt (essentially RAG), currently including:

  • Definition of the current class (list of variables and methods).
  • A list of related methods (retrieved using a similarity search in an embedding database of the entire system AND graph-based strategies such as references to the current method).
  • A list of related classes (retrieved using similarity search and graph-based strategies) plus their definitions.
  • And even a generic code sample explaining the general syntax of the programming language.

However, as you might imagine, this results in a pretty large prompt:

Token Usage and Latency

Summarizing all of the above, a typical prompt includes several thousands of tokens, resulting in one to a few cents per completion and a latency of up to 10 seconds. The token usage even grows as often, a single completion is not good enough, so I request about 10 different completions at each invocation. For a tool that is expected to insert completions in your editor all day as you type, this is too expensive and too slow.

One thing that I have experimented with is splitting up the completion into two stages, generation and contextualization. I believe GitHub Copilot uses a similar mechanism. Generation is invoked whether the major context changes (e.g., the name of the function you are typing) and contextualization is invoked whenever you type a new character. Thus, only generation requires a large prompt context, while contextualization only requires the original generation result and the current prefix typed by the user. As the task of contextualization is much clearer specified using these inputs, I do not need to use an expensive GPT-4 model for this but GPT-3.5 Turbo, which is currently 20x cheaper and much faster, suffices. I might even employ heuristics for this step when the user just uses a different variable name than in the generation etc., which also include sophisticated caching strategies.

Related Work

Unfortunately, I did not find a lot of related work specifically regarding the practical implementation of an interactive code completion tool, while there are a lot of publications and posts on code generation in general.

Some resources that I found slightly relevant so far are:

tl;dr: Despite the current advent of AI code completion tools, reproducing them comes with more challenges than you might assume. My largest unsolved issues today revolve around reducing the latency of the code completion and token usage while still including enough information about an unknown framework/language in the prompt and keeping the completion up to date for every character the user types.

Have you ever worked on a similar problem, or do you know any relevant materials I should check out? Have you got any other tips or ideas for me to try out? Or do you know other communities which are more specialized for this question? Whatever is on your mind, I would love to hear it!

1 Like

Here’s a VSCode plugin I created that can call OpenAI. You could use it as a template to get started with, or just see how this would be done in VSCode at least as far as the app scaffolding.

I haven’t ran it in a while but it will work unless OpenAI has made any breaking changes on their end.

1 Like

Are you sharing your progress on this tool publicly? I would love to have a pointer to your repo if it is public.

1 Like

Thanks for your interest! Unfortunately, I have not yet been able to open-source it, but I set a reminder to notify you when I have done it!

I think providing project-specific context is the hardest part. I would love to hear anyone’s solution

1 Like