[Paper] AstroLLaMA: Towards Specialized Foundation Models in Astronomy


Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.


Two things from this paper jumped out at me,

  1. With a relatively small fine-tuning of 230M tokens the 7B parameter AstroLLaMA was able to generate text completions of quality comparable to GPT-4.
  2. Using the fine-tuned model as an embeddings model is fascinating and raises some interesting ideas.

With respect to (1), I wonder how impactful an identical fine-tuning process would be using gpt-3.5-turbo with its 175B parameters. This same job would cost less than $2,000 to fine-tune gpt-3.5-turbo.

With respect to (2), there isn’t any obvious (to me anyway) reason why OpenAI couldn’t pass back a similar embedding vector[1] for all inputs (and possibly outputs?) which would have interesting implications for HyDE. @curt.kennedy I’d be interested to read your thoughts.

  1. To get text embeddings from AstroLLaMA, we pass an input through the model and extract its final hidden states, which contain embeddings for all tokens in the input. Then, we omit the [BOS] token and take the average of all other tokens’ embeddings to get the final result. ↩︎


As we have mentioned before, I wonder how much of even the 7b llama model is given over to multilingualism rather than “knowledge”.

As far as the how much effect would it have… given that the announcement text and anecdotal evidence seems to suggest that the GPT-3.5 fine tuning can be influenced by even a small number of examples, the effect should be profound.

Would be worth applying to OpenAI/MS for a grant to try it.

1 Like

Very interesting paper, thanks for sharing!

This is also the question I’m asking myself

My guess is that it would be much the same, but handle error correction a lot better in situations where the input contains misspellings or misinterpretation.

Here’s my (dumb) prompt for testing that:

tell me about the song "secret asian man"

GPT-3.5-T answers:

I believe you might be referring to the song “Secret Agent Man.” It’s a popular song originally performed by Johnny Rivers in 1966. The song…

Llama-7B answers:

secret Asian man is an American sitcom…

In the example above, GPT-3.5t actually gets the question right, while Llama-7B is just hallucinating about some American sitcom that doesn’t exist (I’ve checked, the closet thing is this comic strip)

My personal takeaway from this is that not going to use Llama-7b for HyDE anytime soon :sweat_smile:

This same job would hit the max of $400. And consider more than one epoch likely needed.

  • The maximum number of total tokens trained per job is 50 million tokens (tokens_in_dataset * n_epochs).

  • Each file is currently limited to 50 MB

Contrast to OpenAI’s paper Evaluating Large Language Models Trained on Code

We fine-tune GPT models containing up to 12B parameters
on code to produce Codex. In contrast with GPT, Codex
displays non-trivial performance on the HumanEval dataset.

We train Codex using the same learning rate as the corre-
sponding GPT model, with a 175 step linear warmup and
cosine learning rate decay. We train for a total of 100 billion
> tokens, using the Adam optimizer with β1 = 0.9, β2 = 0.95,
 = 10−8, and a weight decay coefficient of 0.1.
In order to maximally leverage text representat…

No, of course not.

My thinking in terms of HyDE was that if the architecture allows it[1], returning the final hidden-state embeddings of the input (and output if possible) from an API call to gpt-4 or gpt-3.5-turbo could make implementing HyDE simpler, faster, and better.

  1. I imagine it should. ↩︎

1 Like

Yeah that seems sensible enough, I don’t see any reason why that shouldn’t be possible.

Speaking more generally, I really think we need some standard way to quantify the “quality” of responses, beyond just measuring the difference between the response and training data.

Standard way? See the convergence of the batch report giving results of statistically significant validation file also uploaded.

Yeah, I’m aware of the limitations, I imagine for certain enterprise and research partners the limits could be lifted.

1 Like

I’m aware, that’s not what I’m talking about. I was talking about needing a different way to quantify “quality” beyond that, like A/B testing but without the human :thinking:

“Ask GPT-4 about the correctness or adherence to desired goals” has been a lot of automated evaluations, but it also has its limitations, as we can assume gpt-4 would agree with its own conclusions.

1 Like

Exactly, I’m also assuming it’s going rate responses written by GPT-4 higher even if the alternative has the exact same conclusion.

I think the paper makes a pretty big mistake when they assume that since the Ada-002 embeddings are tightly clustered compared to their internal embeddings, must mean that their embeddings are much better.

This is because the scale of Ada-002 is 10x tighter than normal (it’s non-isotopic). Also, the number of tokens used in Ada-002 is much more than Llama, (100k in Ada, vs 32k in Llama).

There appears to be evidence that Ada-002 was trained on so much stuff, including lots of garbage tokens, that the tight correlations are partially explained by the vast amount of token sequences it was trained on.

But also, I don’t think they were monitoring the distribution of the vectors during training, and were not penalizing vectors that get too close to one another when they shouldn’t. (So not using Kullback–Leibler Divergence)

I wish I had a better explanation for what is going on in Ada-002 embeddings.

But saying that “Ada-002 embeddings are tight, therefore my embedding model is good” seems to me like an incorrect conclusion.

Wondering if forcing the latent space of embeddings to have more of a normal distribution about them for similar terms is possible in the model. Something like this should spread the embeddings out.

The paper shows they are far from a normal distribution in Ada-002.

1 Like

100% after that this is a bad assumption on their part.

What I would have liked to see instead of this,

Would have been taking a set of embeddings from each model, finding the top-n matches from a vector database of all of their embeddings, and having some algorithmic way to score the results.

Basically create something like AstroBEIR, to show the embeddings are actually better, not just with greater variance.

Edit: More generally, what do you think about the concept of fine-tuned embeddings?

1 Like

I think fine-tuned embeddings can be amazingly useful because they are reflecting your values more than some generic meaning.

I haven’t tried any internal embeddings myself, have you? How would you get these?

A while ago I started to set up a model where I take the Ada-002 embedding as an input (so 1536 floats) and feed it through my own deep classification network.

Would I take the vector in the final layer prior to classification as the embedding vector or something?

I haven’t messed with the internals of Llama to grab embeddings, which sounds like the paper did, any idea if this is easy?

1 Like

I think fine tuned embeddings would be a logical addition to openAI’s arsenal, especially now after fine tuning has been introduced.

I haven’t tried this with Llama specifically, but it should be fairly simple to get the data, I tried with a different model, and it just seemed to dump everything on my device.

ada-002 embeddings was pretty much released as a black box. The only thing we get is its dimensions, the 1536, in comparison to 1000-11000 of GPT-3 embeddings.

Given: GPT-4 training was done over a year ago. The embeddings could even be one of the sub-models that was trained on a specialization, some more optimized and a bit smaller according to a plausible gpt-4 article, and then another that seemed more insider and detailed.

It is interesting that both for embeddings, the codex that they worked on, and the choice of this astro project, that the jump from 13b to 175b, while giving a massive increase from “cute” to “OMG” in language inference, often doesn’t benefit, and even penalizes, specialized use.

I have not. I haven’t played with LLaMA at all yet, and it’s not possible with the OpenAI models (yet anyway).

It looks like it’s incredibly straightforward with LLaMA 2.

I imagine OpenAI could (if they wanted to) include a parameter in the chat/completions/ API to return an embedding as an amalgamation of the final hidden layer, with or without continuing on to the generation phase. Who knows, maybe that’s how all the first generation text embedding models worked? They simply computed the hidden layers of the input, grabbed the last layer and reduced that to an embedding vector?

It would be absurdly more expensive than text-embedding-ada-002 to use a fine-tuned gpt-3.5-turbo just for embeddings, but… If it was possible to collect the embeddings while using the model… It might make sense.

The thing I don’t know about is if it’s possible to get an equivalent embedding vector of the response basically “for free” in the same way it is for the input, if it is, that would change the math somewhat since you could get the (presumably) higher-quality embedding of a hypothetical response at additional no cost, which might make an upfront investment of seeding your VDB knowledge base with much more expensive embeddings worthwhile.

It would also greatly increase the value of fine-tuning a model because you’d also get a fine-tuned embedding model essentially free.

All of that is predicated on the hypothetical that you could get these sorts of embeddings of the outputs without additional computation—which I’m skeptical of.

1 Like

I wondered if you could explain the non-isotopic in this context, my understanding is that isotopic is a property of knot loops in topology.

Sorry meant to say non-isotropic, with the ‘r’.

Here us a paper for more context.

Previously I have used ABTT, mentioned in the paper, to help resolve this.