ada-002 embeddings was pretty much released as a black box. The only thing we get is its dimensions, the 1536, in comparison to 1000-11000 of GPT-3 embeddings.
Given: GPT-4 training was done over a year ago. The embeddings could even be one of the sub-models that was trained on a specialization, some more optimized and a bit smaller according to a plausible gpt-4 article, and then another that seemed more insider and detailed.
It is interesting that both for embeddings, the codex that they worked on, and the choice of this astro project, that the jump from 13b to 175b, while giving a massive increase from “cute” to “OMG” in language inference, often doesn’t benefit, and even penalizes, specialized use.
I have not. I haven’t played with LLaMA at all yet, and it’s not possible with the OpenAI models (yet anyway).
It looks like it’s incredibly straightforward with LLaMA 2.
I imagine OpenAI could (if they wanted to) include a parameter in the chat/completions/ API to return an embedding as an amalgamation of the final hidden layer, with or without continuing on to the generation phase. Who knows, maybe that’s how all the first generation text embedding models worked? They simply computed the hidden layers of the input, grabbed the last layer and reduced that to an embedding vector?
It would be absurdly more expensive than text-embedding-ada-002 to use a fine-tuned gpt-3.5-turbo just for embeddings, but… If it was possible to collect the embeddings while using the model… It might make sense.
The thing I don’t know about is if it’s possible to get an equivalent embedding vector of the response basically “for free” in the same way it is for the input, if it is, that would change the math somewhat since you could get the (presumably) higher-quality embedding of a hypothetical response at additional no cost, which might make an upfront investment of seeding your VDB knowledge base with much more expensive embeddings worthwhile.
It would also greatly increase the value of fine-tuning a model because you’d also get a fine-tuned embedding model essentially free.
All of that is predicated on the hypothetical that you could get these sorts of embeddings of the outputs without additional computation—which I’m skeptical of.
Edit: the study posted by Curt does the same thing, just with individual words instead of sentences, they also conclude basically the same thing:
We find that the word vectors are not centered around the origin, and the average cosine similarity between two random words is much higher than zero, which indicates that the word vectors are distributed in a narrow cone and deteriorate the representation capacity of word embedding.
We just wanted to take a moment to express our gratitude for this engaging discussion from the community. We’re primarily a team of astronomers with a background in computer science, working on this challenge. The insightful conversations we’ve had will undoubtedly shape our thinking for future iterations. Rest assured, we’re not stopping here. Thanks again for your invaluable input! And clearly, contribution is more than welcome!
+1 on that, I’m always happy to see other researchers join the forum!
@curt.kennedy good paper, their proposed solution seems very elegant, but it also seems more computational intensive than just masking out a large randomly distributed set of tokens during each iteration of the training process. Is there something I’m missing here?
You probably aren’t missing anything. Usually it takes considerable computational work to get the embedding space to be more evenly distributed. For example, batch processing a set of embeddings using ABTT is pretty extensive.
In the end, it may not be wise for the end user to redistribute the embeddings, and it may also be prohibitive for foundational model developers to do the same.
Also, in the end, you get what you compute! Until there is massive shaming of foundational models’ embeddings not being isotropic, it will continue. Just don’t be surprised that “all my embeddings are in a small cone”.
The vector geometry is non-sensical coming out of the models. User beware, I suppose.
I really enjoyed your paper and was excited to foster some discussion around it here.
First, I would like to encourage you to apply to the OpenAI Researcher Access Program. I would be very excited to see the results you would get using a much stronger base model than LLaMA 2 7B.
Next, since you’re here I hoped you might be able to join in the discussion and provide some more insights.
Aside from the Perplexity score, did you consider any other metrics for the quality of the completions?
Astrophysics isn’t my area of expertise, so looking at your examples didn’t give me much insight into the relative quality of the responses. Can you add a little about how you found the results to be generally? Do they tend to make sense or do they read like a smart second-year bullshitting their way through something they didn’t prepare for?
Regarding the embedding results, again without knowing much about astrophysics it’s difficult to understand if the retrievals are better or worse than text-embedding-ava-002. Did you consider creating an astrophysics-specific retrieval benchmark against which to measure? Because, as has been noted here already the variance in the embedded space isn’t a particularly good measure of the quality of the embeddings.
Jo here from universeTBD, thank you so much for your recommendation. I will make the application for the Researcher Access Program asap so we can hopefully get access to OpenAI’s models yay.
Regarding the Perplexity score, we’ve been thinking about using MAUVE to compare the generated abstract to the validation/ground-truth abstract. We are also currently building a framework for expert evaluation, but that will take a bit more time to complete.
Based on our internal expertise evaluation, the generated abstracts show a high degree of logical consistency, but of course, the quantitative details don’t make much sense. However, this is something that can be easily fixed with fine-tuning or instructing the model not to give quantitative details.
We are happy to think more about creating the astrophysics-specific retrieval benchmark and will discuss this with the team. Thank you so much again and thank you to the community for engaging with our first project!
Just circling back about the Research Access Program. We were supported by the Microsoft Accelerate Foundational Model Research Initiative (and hence have some OpenAI tokens ) and are currently in the renewal evaluation phase. Got any fresh thoughts on whether it’s still a good move for us to apply for OpenAI’s Research Access Program.
On the retrieval front, the comment of taking sets of embeddings from each model, finding top-n matches, and then algorithmically scoring them is absolutely pertinent. Totally agree that retrieval is a key benchmark. Jo mentioned we’ll be running some tests soon. Interestingly, some astronomy groups are already diving into this using AstroLLaMa.
I also wanted to touch on why we showcased those instances where AstroLLaMa and Ada disagree, especially in relative rankings disregarding the absolute variance. In my experience, AstroLLaMa seems to nail the context—like, it really “gets” the relationship between papers even if they don’t explicitly mention the same topics. For instance, a paper that doesn’t directly talk about exoplanets but focuses on their impact on stars (like Lithium abundances) still ranks high in similarity with other exoplanet papers. Ada, meanwhile, seems to fixate on certain keywords and miss the overall context. While this is not the same as a retrieval task, the reason that we showcased this is because for some of the downstream tasks we have in mind (e.g., constructing a knowledge graph) often relies only on the correct pairwise similarity between papers instead of a general retrieval task. That said, we totally agree that we should test the latter too because it is a different benchmark.
As for the generative capabilities of AstroLLaMa, as Jo mentioned, it’s pretty context-aware and logical, and it doesn’t just spit out generic text like you’d find in a pop science book. Even though it’s a small model, it’s impressing me (and my astronomy instincts) for both abstract completion and embedding/paper similarity tasks. This is why we thought it could be a community asset, and shamelessly release it even it is a super small model.
Finally, we’re not eyeing a GPT-3.5 training session right now; we’re headed in a different direction. But the discussion here suggests that going down that route could be relatively straightforward and cost-effective. We definitely open to other contributions, particularly if we can stick to the same training data for an apples-to-apples comparison.
Keep those insightful comments coming, and don’t hesitate to push us hard. We’re learning a ton from all of you.
I don’t think there’s any rule or restriction against doing both, but if you have a contact with the Microsoft program it wouldn’t hurt to ask. To the best of my knowledge they’ve entirely separate programs.
This is exciting to read, it’s a shame about text-embedding-ada-002, but the idea of a domain-specific fine-tuned embedding model is very interesting!
Following on from what @elmstedt has said, absolutely apply for the OpenAI Research Program, this is exactly what it was created for, would be super exciting to see GPT-3.5 fine tuned with similar data.