Codex completion returns lines of text of an arbitrary width

Hi community,

I use Codex to get code explanations. These completions often times contain lines of text formatted to an arbitrary width of characters, for example, between 70 and 100 characters long. After a certain length, a \n character is inserted breaking the flow of a sentence.

We post-process the answers from Codex, however, I’m finding this problem a bit tricky to solve when I take into consideration other cases. I was wondering if you have seen this formatting style before and how you fixed it with prompt engineering.

Here’s an example of a completion trimmed to an arbitrary length:

<video width="620" controls poster="https://upload.wikimedia.org/wikipedia/commons/e/e8/Elephants_Dream_s5_both.jpg" >
"""
Explain what the code is doing:
The code is using the video tag to display a video. The width of the video 
is 620 pixels and it has controls for play, pause, volume etc.

Expected result:

<video width="620" controls poster="https://upload.wikimedia.org/wikipedia/commons/e/e8/Elephants_Dream_s5_both.jpg" >
"""
Explain what the code is doing:
The code is using the video tag to display a video. The width of the video is 620 pixels and it has controls for play, pause, volume etc.

Eddie

Hello @ediardo,

It looks like the comments you have are Python comments, so I am going on a limb and assuming that you are working with Python.

According to PEP 8, the style guide for good Python code, comments in your source code should not exceed 72 characters:

The Python standard library is conservative and requires limiting lines to 79 characters (and docstrings/comments to 72).

Source

This is simply good practice, which is why Codex produces newline characters around that length. If you don’t want ‘\n’ to occur at all, then you need to specify the ‘logit_bias’ parameter in your Completion API call.

‘\n’ is token ID 198, according to the tokenizer tool that OpenAI provides.

Among the other parameters, you need to add logit_bias={"198":-100} in order to prevent the newline token from being generated. However, that will prevent a new line from being generated during the whole completion so you may need to modify the prompt so that Codex only generates comments after """\n.

You could also try telling Codex to ignore standard Python Enhancement Proposals (PEP) practices and that may help Codex produce desired results!

Hope this helps!

3 Likes

Thanks for the reply @DutytoDevelop. I certainly learned something new from you, as I didn’t know you could use logit_bias that way.

I’m not using Python, and there isn’t a single hint of Pythonic code in the prompt. Could it be that the way that Python programmers format code influences how Codex determines what good completions should look like? Short lines of code are encouraged in other languages as well. My hypothesis is that Codex formats explanations in natural language the same way it would do for code: ideally lines should be less than 100 characters long.

1 Like

Thanks for the correction that you’re not using Python. May I ask what programming language you are using? Typically you get better results when you use the same multi-line comment syntax that’s paired with the programming language you are getting Codex to produce.

Also, that is a good theory @ediardo, you could be right! Codex was trained on all data from GitHub, so it’s possible it picked up the general comment length per line from all that data as well.