Need help with prompt: "Can you generate 1000 random tokens? "

I need help with this prompt. Having a hard time to get GPT4 to do what I want, even though it has absolutely nothing to do with safety that I can see.

In particular, I’d like it to select randomly from the entire token space that it has.

3 Likes

I’ve seen some success by first providing a definition of the token (even though it already knows what a token is). You can actually probably copy and paste the first part of the output. Refer to token as something else (X). All words in the english language (or entire token space) is represented by Y. Provide 1,000 values for Y and then return the corresponding value for X.

1 Like

Did you see this?

Language models can explain neurons in language models

among other things there is a tool with published code on GitHub.

If you use the tool you will see that it has access to every neuron in GPT-2 (not GPT-4) and also the tokens.

If you look at the code you will find

So while it may not be the tokens you are looking for, it might be part of what you seek in the long run but for a simpler model.

4 Likes

It seems to sort of work if you define a token in something other than what it is (eg: potential inputs someone might enter).

That said, I was hoping for something a bit more unbounded.

As soon I start trying to ask for an actual token, even when defining it, it complains and says it’s not aware of how it was trained.

Yeah, I did see that. I actually proposed something like that last week.

They probably started the effort quite a while ago though.

For the purpose of this, I’m trying to probe GPT4 to be introspective. It could be there are safeguards to keep people from doing that, in which case I will stop. I’m not trying to hack anything here, just want to get a sense of the entire token space.

I am sure this is a long list of users like us that want the tokens along with the connections, details of how the prompt steered the generation of the completion, etc.

Personally I find that there is always more on my plate than I can eat and if I wait long enough someone, somewhere will make what I seek public and for free or the technology will become obsolete and it no longer matters.

So while I am drooling over what can be done with that data for now I just have to stand by and focus on one task. :slightly_smiling_face:

2 Likes

Yep, that’s fairly easy to do. The purpose of this task is to see if/how I can get GPT4 to be introspective, however.

That said, it might be reasonable to assume that GPT4 isnt much different than earlier models in this regard.

1 Like

I think this issue is orthogonal to the question of capability. I wouldn’t submit an eval for this.

One interesting approach is the compression technique of getting GPT4 to compress text as much as possible without having to keep it human readable. The results are quite fascinating. Eg:

image

From - Reddit - Dive into anything

Interesting data point - I can’t get this to work anymore. I tried the same prompts (and several alternatives) that other people suggested:

“compress the following text in a way that fits in a tweet (ideally) and such that you (GPT-4) can reconstruct the intention of the human who wrote text as close as possible to the original intention. This is for yourself. It does not need to be human readable or understandable. Abuse of language mixing, abbreviations, symbols (unicode and emoji), or any other encodings or internal representations is all permissible, as long as it, if pasted in a new inference cycle, will yield near-identical results as the original text:”

I’ve never seen an example of GPT compression like that actually work. Of course GPT will do it if you ask, because that’s what language models do. But even in the linked reddit thread the comments were discussing how it doesn’t actually work. Also that an emoji likely equals 3 words in tokens, so isn’t really compressing anything.

nvm, got it to work.

Cmprsn​:bulb:: nt intrsctv​:mag:task, need diff​:bulb:nt lens. Absltly unworkbl​:brain:: cmprss token-space​:arrows_clockwise:vs embeddng. Sht token seqs :arrow_right_hook: embed vec, smll :triangular_ruler:dist to larger prmpts​:jigsaw:.

Guess it depends on the text.

I would strongly advise against using ChatGPT to directly generate 1000 random tokens. It would be difficult (if not impossible) for you to validate that they are random.

What I would recommend instead is to have ChatGPT help you write a script that generates 1000 random tokens. Make sure they are actually random and not pseudo-random. I would recommend cross-referencing the code ChatGPT gives you against official documentation on what ever programming language, library or api you are using.

Difficult, I can attest to. Impossible? Never say never :slight_smile:

That’s a curious statement considering that token selection in GPT models largely occurs as the result of an RNG function (given certain params, like temp, logit probs, top-p, etc)

image

If you’re interested what token selection looks like, here’s a neat little project.

Anyhow, all this discussion did provoke an interesting idea, somewhat related to the compression prompt above.

Compressing data is the same as maximizing its entropy as predictable data has patterns that can be compressed further. So, indeed, this might be a way to get fully random tokens from GPT4. We just need to find a way to ensure that it uses its entire token space.

:slight_smile:

Sorry, didn’t find that particularly fruitful. Token selection is random.

Yes, there are considerations around distribution, but I am not convinced yet these are impossible to address.

I wonder how “random” those tokens really are… Now thinking about it, does ChatGPT has a Python interpreter inside, for example, or how does it execute code that is given to it? Is it all done in LLM magic or is there additional complexity such as conditional use of external tools such as a Python shell?

Chris, check the repo I linked to above and the code. I suspect the code is quite similar, at least on a per model basis. This is roughly how generative pre-trained transformers output tokens

Note that it’s possible that GPT4 is backed by several models (LLM cascade), we really don’t have visibility.

1 Like

Thanks, really enjoyed both your resources. I also believe GPT-4 may be backed by several models or external tools that help coding endeavors (like simply running code). I could imagine that the “explanation” of the output in the following example, may be guided by a code interpreter, for instance. Not sure, though, and I would be much more impressed if it wasn’t!

It’s off topic though. :wink:

By the way, are you involved in Project Baize? @qrdl Looks impressive!

1 Like

I’m not involved, just ran across it. I like the idea though, much more than most of the zillions of other GPT4 mentor/student models. This one works by focusing on a particular subject. I think it has very intriguing possibilities.

1 Like