Adding regex support to the API

I currently use Codex a lot and I love it. However it frustrates me that I can’t tell Codex what output I want.

Often when I want to generate code, Codex instead generates comments. Especially when I use Codex in the shell (GitHub - tom-doerr/zsh_codex: This is a ZSH plugin that enables you to use OpenAI's Codex AI in the command line.) and just generate a single line, I have to generate a line over and over until it finally isn’t a comment.

The opposite is true as well. When I want to just generate comments it generates code at some point.

It would be great if you could add support for regexes so only those tokens are selected which are still allowed by the regex. For example if I could pass the regex r'^[^#]' to the API I could be sure that I get code instead of a comment.

This feature might also be useful for generating CSV, JSON, etc. files.

What do you think?

2 Likes

Thank you for the tip! I wasn’t aware of that feature, that might help to some degree.
However there are disadvantages compared to regexes.

  • #, ##, ### and ### are different tokens and there are probably many more that contain a #. If I want to avoid those to avoid comments, one would need to adjust the logit_bias for all.
  • The token couldn’t be used in the code anymore, e.g. list.append('#') couldn’t be generated anymore
  • If I want to force comments, adjusting logit_bias wouldn’t work. I would have to generate the text line by line with every line starting with a #.
  • For my shell plugin I only want the beginning of the next line to not contain a comment. For the lines after that comments can be quite helpful.
3 Likes

Hello @tom.doerr,

Do you have an example or two that you can provide us that results in Codex generating comments when they’re not desired? When you’ve run into this issue, did the file contain a lot of single-line comments?

I could try to help by tweaking the prompt to see if I can find a good technique to get the desired output you’re wanting!

3 Likes

I believe for real regular expressions (modelling a regular language) you don’t ever need to backtrack. It’s a property of regular languages. The engine can check at every step what tokens are allowed and select the most likely token from the set of allowed tokens.

I could generate a lot of completions and then check what completions match my regex, but that seems wasteful, slow and isn’t guaranteed to give me a solution in a reasonable amount of time.

If I don’t get the desired completion I am looking for, I’ll spend some time tweaking my prompt until I have made the changes necessary that are robust enough to prevent any undesirable traits in the completion.

For my shell plugin I only want the beginning of the next line to not contain a comment. For the lines after that comments can be quite helpful.

You may have already tried this, but did you emphasize to Codex that you only want the beginning of the next line to not contain a comment? If you did try this already, then it’s possible that we need to modify the prompt. I’ve encountered situations where if I have a request but it’s towards the end of the prompt, it often gets ignored by Codex due to the earlier requests taking precedence

@DutytoDevelop Here’s an example:
I wrote the comment

~ # git list all branches with recent activity.

in my shell and completed it using Codex. This is the result I got:

#
# usage: git-recent-branches [branch]
#
# if no branch is given, all branches will be listed.
#
# The output is suitable for usage with a pager like 'less' or 'more'.
#
# inspired by:
# http://stackoverflow.com/questions/29058322/git-list-all-branches-with-recent-activity
#
# Copyright (C) 2014 by Andreas Wegmann <andreas.wegmann@vectris.de>
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
# more details.
#
# You should have received a copy of the GNU General Public License along with
# this program; if not, write to the Free Software Foundation, Inc., 59 Temple
# Place, Suite 330, Boston, MA 02111-1307 USA

All text I complete in the shell is appended to a ZSH shebang line, i.e. my prompt looks like this:

#!/bin/zsh

# git list all branches with recent activity. 

How would you emphasize to Codex that the beginning of the next line should not be a comment?

EDIT 10/20/2021: I fixed the below completion @tom.doerr. I didn’t set the max_token parameter high enough so it chopped off the rest of the command. Woops!

#!/bin/zsh

# Write a one or two line code-snippet for the following:
# git list all branches with recent activity. 

# Code:

git for-each-ref --sort=-committerdate refs/heads/ --format='%(HEAD) %(color:yellow)%(refname:short)%(color:reset) - %(color:red)%(objectname:short)%(color:reset) - %(contents:subject) - %(authorname) (%(color:green)%(committerdate:relative)%(color:reset))'

Note: I’m not sure if the above completion was the correct git command @tom.doerr, but I did get Codex to veer away from comment-only completions!

‘# Write a one or two line code-snippet for the following:’ and ‘Code:’ are important because your previous prompt contains only a commented git command, so the initial pattern that Codex picks up on is that the rest of the completion should be commented as well.

To change this behavior, you will need to find ways to ask it for a certain output. After a couple tries, I came up with the prompt template shown above.

Adding in additional information specifying how Codex should respond will greatly improve desired output, so finding the right way to engineer the prompt is the trick. By adding a blank line before ‘# Code:’, it helps steer Codex away from the pattern that every line should be a comment and, of course, specifies that the next section of the file should be code.

2 Likes

Good suggestions here. Also note that the last token in the prompt strongly biases the model to output text in the format you want. For example, if the last token is a curly open brace {, the probability of the next tokens being generated as comments is essentially 0.

2 Likes

I encountered this a lot in the playground as well. I found the easiest solution is to just type the first letter of the new line - so in your case, just

#!/bin/zsh

# git list all branches with recent activity. 

g

This way the model is more or less guaranteed to continue with code. It can be slightly annoying in cases where you don’t know what the line should start with.

1 Like