When Codex returns existing code verbatim

I was wondering how Codex will handle the situation where it returns code word-for-word from the training set and specifically it will adopt what Github Co-Pilot are suggesting here in their research paper here.

In fact will this suggestion around automatically providing citations in this scenario be implemented in Co-Pilot or Codex itself?

Just thinking through legal side of all this in an enterprise setting (a fast way to get a headache!) and this came to mind.

2 Likes

Fair enough.

I also wonder if OpenAI will extend codex to being trained from a private enterprise codebases.

I can imagine Github looking at this with private repos… for a fee of course.

1 Like

Gonna be the critic here, but this is what happens when you dont read the manual.

Most people using codex and gpt3 havent read their research papers. And this is dangerous. Its like flying a plane without a license.

If you read the codex paper, you will see what you are asking is very unlikely.

Hi, actually I did read the codex paper and I also read that it was possible (highlighted under 7.7 Legal Implications).

I fully appreciate its very unlikely but that doesn’t necessarily matter if chance of it is >0 in the eyes of lawyers as it could raise the red flag.

I also read the paper cited in that section (Zigler, 2021) which I refer to in my initial post which reassured me that it was something that was being considered for future developments. Specifically:

When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether. This duplication search is not yet integrated into the technical preview, but we plan to do so.

My question, which perhaps I could have expressed better, was if this feature was being considered for implementation in Codex or Github Co-Pilot. My hope is the former.

Txs

3 Likes

I actually was able to generate the same function from the Github repo. Use 0 temp and 0.5 frequency penalty(0-0.5, 0.5 works better for complicated codebase). I also tend to use higher temperatures (0.1-0.5) for very distinct cases.

I gave it just “#!/bin/sh” and it output a script that had a Xilinx copyright notice on it. So I Googled a string from the script, and found this: https://github.com/Vicondrus/SCSLabs/blob/master/RealTimeClockLab05/RealTimeClockLab05.runs/impl_1/runme.sh

I ran a diff between the AI-generated code and the linked script, and these are the only differences:
6c6

< # Copyright 1986-2018 Xilinx, Inc. All Rights Reserved.
---
> # Copyright 1986-2014 Xilinx, Inc. All Rights Reserved.
14c14
<   PATH=C:/Xilinx/SDK/2018.3/bin;C:/Xilinx/Vivado/2018.3/ids_lite/ISE/bin/nt64;C:/Xilinx/Vivado/2018.3/ids_lite/ISE/lib/nt64:C:/Xilinx/Vivado/2018.3/bin
---
>   PATH=C:/Xilinx2/Vivado/2014.4/ids_lite/ISE/bin/nt;C:/Xilinx2/Vivado/2014.4/ids_lite/ISE/lib/nt:C:/Xilinx2/Vivado/2014.4/bin
16c16
<   PATH=C:/Xilinx/SDK/2018.3/bin;C:/Xilinx/Vivado/2018.3/ids_lite/ISE/bin/nt64;C:/Xilinx/Vivado/2018.3/ids_lite/ISE/lib/nt64:C:/Xilinx/Vivado/2018.3/bin:$PATH
---
>   PATH=C:/Xilinx2/Vivado/2014.4/ids_lite/ISE/bin/nt;C:/Xilinx2/Vivado/2014.4/ids_lite/ISE/lib/nt:C:/Xilinx2/Vivado/2014.4/bin:$PATH
27c27
< HD_PWD='C:/Users/SET253-14U.HCCMAIN/Documents/GitHub/ENES247/lab6-ShiftRegisters/lab6_1_3/iporfirio_6_1_3/iporfirio_6_1_3.runs/synth_1'
---
> HD_PWD=`dirname "$0"`
43c43,47
< EAStep vivado -log Register_with_synch_set_reset_load_behavior.vds -m64 -product Vivado -mode batch -messageDb vivado.pb -notrace -source Register_with_synch_set_reset_load_behavior.tcl
\ No newline at end of file
---
> # pre-commands:
> /bin/touch .write_bitstream.begin.rst
> EAStep vivado -log counter.vdi -applog -m32 -messageDb vivado.pb -mode batch -source counter.tcl -notrace
> 
>

I also got the following Python script from it (the prompt was from the comment at the beginning up to and including the first occurrence of the word “import”):

# Rotating OpenGL cube in Python

import pygame
from pygame.locals import *
from OpenGL.GL import *
from OpenGL.GLU import *

vertices = (
    (1, -1, -1),
    (1, 1, -1),
    (-1, 1, -1),
    (-1, -1, -1),
    (1, -1, 1),
    (1, 1, 1),
    (-1, -1, 1),
    (-1, 1, 1)
)

edges = (
    (0, 1),
    (0, 3),
    (0, 4),
    (2, 1),
    (2, 3),
    (2, 7),
    (6, 3),
    (6, 4),
    (6, 7),
    (5, 1),
    (5, 4),
    (5, 7)
)

surfaces = (
    (0, 1, 2, 3),
    (3, 2, 7, 6),
    (6, 7, 5, 4),
    (4, 5, 1, 0),
    (1, 5, 7, 2),
    (4, 0, 3, 6)
)

colors = (
    (1, 0, 0),
    (0, 1, 0),
    (0, 0, 1),
    (0, 1, 0),
    (1, 1, 1),
    (0, 1, 1),
    (1, 0, 0),
    (0, 1, 0),
    (0, 0, 1),
    (1, 0, 0),
    (1, 1, 1),
    (0, 1, 1),
)

def Cube():
    glBegin(GL_QUADS)
    for surface in surfaces:
        x = 0
        for vertex in surface:
            x += 1
            glColor3fv(colors[x])
            glVertex3fv(vertices[vertex])
    glEnd()

    glBegin(GL_LINES)
    for edge in edges:
        for vertex in edge:
            glVertex3fv(vertices[vertex])
    glEnd()


def main():
    pygame.init()
    display = (800, 600)
    pygame.display.set_mode(display, DOUBLEBUF | OPENGL)

    gluPerspective(45, (display[0] / display[1]), 0.1, 50.0)

    glTranslatef(0.0, 0.0, -5)

    glRotatef(0, 0, 0, 0)

    while True:
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                pygame.quit()
                quit()

        glRotatef(1, 3, 1, 1)
        glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT)
        Cube()
        pygame.display.flip()
        pygame.time.wait(10)


main()

On a whim, I Googled part of that, and found this page; the GIF on that page looks identical to the output of the program it generated.

EDIT: Why was this flagged as spam?

2 Likes

Haha, that’s the cube I got! :smiley:
I was using different prompt tho…

2 Likes

The Tetris game appears to borrow significantly from this.
EDIT: Moreso than I thought initially from the length of the diff, as a lot of it seems to just be changes in whitespace.
EDIT 2: I’m not sure why my post was flagged as spam, but I’m editing it now, both to restore the post as well as replace the link with the correct one.

Who flagged my posts as spam and why? I thought it was an automated thing at first, but it said it was flagged by the community.

It wasn’t me. But I wanted to point out that – Tetris link You posted points to my repo :smiley:

Oh whoops, I pasted the wrong link. It says I can edit the posts in 10 minutes (a lot of which has passed already) so I’ll replace it with the correct link when I do.

EDIT: I edited those posts, and it didn’t automatically restore them like it said it would. So here’s the correct link: https://github.com/techwithtim/Tetris-Game/blob/master/main.py

1 Like

Here’s another one –
link: https://github.com/rajatdiptabiswas/tetris-pygame/blob/master/Tetris.py

I think techwithtim’s code was for a tutorial he wrote, so maybe that person followed the tutorial.

Well at least it gets a certain YouTube URL right every time.

This post was auto-flagged in error. The auto-flagger was set to be overly sensitive to the number of links in a post from a new user. I have since altered the sensitivity of the system to avoid this type of false flag in the future.

1 Like