Ahhhh! Funny thing is that isotopic knot loops are also called “embeddings” my mind was quite literally doing “loops”
I found this approach to calculating an anisotropy score in a different paper:
I do like it, it seems to make sense intuitively.
Edit: the study posted by Curt does the same thing, just with individual words instead of sentences, they also conclude basically the same thing:
We find that the word vectors are not centered around the origin, and the average cosine similarity between two random words is much higher than zero, which indicates that the word vectors are distributed in a narrow cone and deteriorate the representation capacity of word embedding.
I think a paper that has a deeper explanation of what is going on in Ada-002 embeddings is this one:
The main problem here is the Representation Degeneration Problem. It’s a fight between rare tokens and the hidden states within the model.
TL;DR, it’s this paragraph:
We just wanted to take a moment to express our gratitude for this engaging discussion from the community. We’re primarily a team of astronomers with a background in computer science, working on this challenge. The insightful conversations we’ve had will undoubtedly shape our thinking for future iterations. Rest assured, we’re not stopping here. Thanks again for your invaluable input! And clearly, contribution is more than welcome!
Hi and welcome to the Developer Forum!
Nice to meet one of the authors of the paper!
+1 on that, I’m always happy to see other researchers join the forum!
@curt.kennedy good paper, their proposed solution seems very elegant, but it also seems more computational intensive than just masking out a large randomly distributed set of tokens during each iteration of the training process. Is there something I’m missing here?
You probably aren’t missing anything. Usually it takes considerable computational work to get the embedding space to be more evenly distributed. For example, batch processing a set of embeddings using ABTT is pretty extensive.
In the end, it may not be wise for the end user to redistribute the embeddings, and it may also be prohibitive for foundational model developers to do the same.
Also, in the end, you get what you compute! Until there is massive shaming of foundational models’ embeddings not being isotropic, it will continue. Just don’t be surprised that “all my embeddings are in a small cone”.
The vector geometry is non-sensical coming out of the models. User beware, I suppose.
Alright, just wanted to make sure I wasn’t missing something important
Agreed, although I think there some wiggle room here depending on who the
end user is.
I was that user. I was so giddy sitting on 80k different embedding vectors. Excited, I created my first Maximum Inner Product search function to return the top 10 correlations near +1, 0, and -1.
I wanted to know what phrases were orthogonal and also opposite of what I put in.
In the end, everything was near +1 using ada-002 embeddings
Hey, I just wanted to welcome you to the forum!
I really enjoyed your paper and was excited to foster some discussion around it here.
First, I would like to encourage you to apply to the OpenAI Researcher Access Program. I would be very excited to see the results you would get using a much stronger base model than LLaMA 2 7B.
Next, since you’re here I hoped you might be able to join in the discussion and provide some more insights.
Aside from the Perplexity score, did you consider any other metrics for the quality of the completions?
Astrophysics isn’t my area of expertise, so looking at your examples didn’t give me much insight into the relative quality of the responses. Can you add a little about how you found the results to be generally? Do they tend to make sense or do they read like a smart second-year bullshitting their way through something they didn’t prepare for?
Regarding the embedding results, again without knowing much about astrophysics it’s difficult to understand if the retrievals are better or worse than
text-embedding-ava-002. Did you consider creating an astrophysics-specific retrieval benchmark against which to measure? Because, as has been noted here already the variance in the embedded space isn’t a particularly good measure of the quality of the embeddings.
Jo here from universeTBD, thank you so much for your recommendation. I will make the application for the Researcher Access Program asap so we can hopefully get access to OpenAI’s models yay.
Regarding the Perplexity score, we’ve been thinking about using MAUVE to compare the generated abstract to the validation/ground-truth abstract. We are also currently building a framework for expert evaluation, but that will take a bit more time to complete.
Based on our internal expertise evaluation, the generated abstracts show a high degree of logical consistency, but of course, the quantitative details don’t make much sense. However, this is something that can be easily fixed with fine-tuning or instructing the model not to give quantitative details.
We are happy to think more about creating the astrophysics-specific retrieval benchmark and will discuss this with the team. Thank you so much again and thank you to the community for engaging with our first project!
Just circling back about the Research Access Program. We were supported by the Microsoft Accelerate Foundational Model Research Initiative (and hence have some OpenAI tokens ) and are currently in the renewal evaluation phase. Got any fresh thoughts on whether it’s still a good move for us to apply for OpenAI’s Research Access Program.
On the retrieval front, the comment of taking sets of embeddings from each model, finding top-n matches, and then algorithmically scoring them is absolutely pertinent. Totally agree that retrieval is a key benchmark. Jo mentioned we’ll be running some tests soon. Interestingly, some astronomy groups are already diving into this using AstroLLaMa.
I also wanted to touch on why we showcased those instances where AstroLLaMa and Ada disagree, especially in relative rankings disregarding the absolute variance. In my experience, AstroLLaMa seems to nail the context—like, it really “gets” the relationship between papers even if they don’t explicitly mention the same topics. For instance, a paper that doesn’t directly talk about exoplanets but focuses on their impact on stars (like Lithium abundances) still ranks high in similarity with other exoplanet papers. Ada, meanwhile, seems to fixate on certain keywords and miss the overall context. While this is not the same as a retrieval task, the reason that we showcased this is because for some of the downstream tasks we have in mind (e.g., constructing a knowledge graph) often relies only on the correct pairwise similarity between papers instead of a general retrieval task. That said, we totally agree that we should test the latter too because it is a different benchmark.
As for the generative capabilities of AstroLLaMa, as Jo mentioned, it’s pretty context-aware and logical, and it doesn’t just spit out generic text like you’d find in a pop science book. Even though it’s a small model, it’s impressing me (and my astronomy instincts) for both abstract completion and embedding/paper similarity tasks. This is why we thought it could be a community asset, and shamelessly release it even it is a super small model.
Finally, we’re not eyeing a GPT-3.5 training session right now; we’re headed in a different direction. But the discussion here suggests that going down that route could be relatively straightforward and cost-effective. We definitely open to other contributions, particularly if we can stick to the same training data for an apples-to-apples comparison.
Keep those insightful comments coming, and don’t hesitate to push us hard. We’re learning a ton from all of you.
I don’t think there’s any rule or restriction against doing both, but if you have a contact with the Microsoft program it wouldn’t hurt to ask. To the best of my knowledge they’ve entirely separate programs.
This is exciting to read, it’s a shame about
text-embedding-ada-002, but the idea of a domain-specific fine-tuned embedding model is very interesting!
Following on from what @elmstedt has said, absolutely apply for the OpenAI Research Program, this is exactly what it was created for, would be super exciting to see GPT-3.5 fine tuned with similar data.
Jo will shortly submit an application !
Thank you so much for all the encouragement for our little “hobby” project with a group of enthusiastic astronomers!
Can I ask what the total training time/cost was for this model?
Also I’m curious if you and your team saw this similar paper.
Very intersting indeed! I am not aware of the radiology-LLM, but it is certainly of interest.
As for the training time, for the three epoch training, if I am not mistaken, it is 4 A100 x 1 day.
I figured since they also fine-tuned a domain-specific LLaMA 2-7B there might be something in there if interest/value to you and your team.
Only 24-hours? That’s amazing. Some back of the envelope math puts that at just over $100 USD training time.
I’m curious if anyone else here has any experience with fine-tuning on an H100?
NVIDIA claims “the H100 is up to nine times faster for AI training and 30 times faster for inference than the A100.” LambdaLabs has a 4xA100 @ $4.40/hr and an 8xH100 @ $20.72/hr…
So, if the H100 really is 9 times faster for training, with twice as many units, it could in theory be done in just an hour and twenty minutes at a cost of $27.63—which would be absurd!
And only a quarter million dollars per 8U, 10 kilowatt server…