Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
With a relatively small fine-tuning of 230M tokens the 7B parameter AstroLLaMA was able to generate text completions of quality comparable to GPT-4.
Using the fine-tuned model as an embeddings model is fascinating and raises some interesting ideas.
With respect to (1), I wonder how impactful an identical fine-tuning process would be using gpt-3.5-turbo with its 175B parameters. This same job would cost less than $2,000 to fine-tune gpt-3.5-turbo.
With respect to (2), there isnāt any obvious (to me anyway) reason why OpenAI couldnāt pass back a similar embedding vector[1] for all inputs (and possibly outputs?) which would have interesting implications for HyDE. @curt.kennedy Iād be interested to read your thoughts.
As we have mentioned before, I wonder how much of even the 7b llama model is given over to multilingualism rather than āknowledgeā.
As far as the how much effect would it have⦠given that the announcement text and anecdotal evidence seems to suggest that the GPT-3.5 fine tuning can be influenced by even a small number of examples, the effect should be profound.
Would be worth applying to OpenAI/MS for a grant to try it.
My guess is that it would be much the same, but handle error correction a lot better in situations where the input contains misspellings or misinterpretation.
Hereās my (dumb) prompt for testing that:
tell me about the song "secret asian man"
GPT-3.5-T answers:
I believe you might be referring to the song āSecret Agent Man.ā Itās a popular song originally performed by Johnny Rivers in 1966. The songā¦
Llama-7B answers:
secret Asian man is an American sitcomā¦
In the example above, GPT-3.5t actually gets the question right, while Llama-7B is just hallucinating about some American sitcom that doesnāt exist (Iāve checked, the closet thing is this comic strip)
My personal takeaway from this is that not going to use Llama-7b for HyDE anytime soon
This same job would hit the max of $400. And consider more than one epoch likely needed.
The maximum number of total tokens trained per job is 50 million tokens (tokens_in_dataset * n_epochs).
Each file is currently limited to 50 MB
Contrast to OpenAIās paper Evaluating Large Language Models Trained on Code
We fine-tune GPT models containing up to 12B parameters
on code to produce Codex. In contrast with GPT, Codex
displays non-trivial performance on the HumanEval dataset.
ā¦
We train Codex using the same learning rate as the corre-
sponding GPT model, with a 175 step linear warmup and
cosine learning rate decay. We train for a total of 100 billion > tokens, using the Adam optimizer with β1 = 0.9, β2 = 0.95,
= 10ā8, and a weight decay coefficient of 0.1.
In order to maximally leverage text representatā¦
My thinking in terms of HyDE was that if the architecture allows it[1], returning the final hidden-state embeddings of the input (and output if possible) from an API call to gpt-4 or gpt-3.5-turbocould make implementing HyDE simpler, faster, and better.
Yeah that seems sensible enough, I donāt see any reason why that shouldnāt be possible.
Speaking more generally, I really think we need some standard way to quantify the āqualityā of responses, beyond just measuring the difference between the response and training data.
Iām aware, thatās not what Iām talking about. I was talking about needing a different way to quantify āqualityā beyond that, like A/B testing but without the human
āAsk GPT-4 about the correctness or adherence to desired goalsā has been a lot of automated evaluations, but it also has its limitations, as we can assume gpt-4 would agree with its own conclusions.
I think the paper makes a pretty big mistake when they assume that since the Ada-002 embeddings are tightly clustered compared to their internal embeddings, must mean that their embeddings are much better.
This is because the scale of Ada-002 is 10x tighter than normal (itās non-isotopic). Also, the number of tokens used in Ada-002 is much more than Llama, (100k in Ada, vs 32k in Llama).
There appears to be evidence that Ada-002 was trained on so much stuff, including lots of garbage tokens, that the tight correlations are partially explained by the vast amount of token sequences it was trained on.
But also, I donāt think they were monitoring the distribution of the vectors during training, and were not penalizing vectors that get too close to one another when they shouldnāt. (So not using KullbackāLeibler Divergence)
I wish I had a better explanation for what is going on in Ada-002 embeddings.
But saying that āAda-002 embeddings are tight, therefore my embedding model is goodā seems to me like an incorrect conclusion.
Wondering if forcing the latent space of embeddings to have more of a normal distribution about them for similar terms is possible in the model. Something like this should spread the embeddings out.
The paper shows they are far from a normal distribution in Ada-002.
Would have been taking a set of embeddings from each model, finding the top-n matches from a vector database of all of their embeddings, and having some algorithmic way to score the results.
Basically create something like AstroBEIR, to show the embeddings are actually better, not just with greater variance.
Edit: More generally, what do you think about the concept of fine-tuned embeddings?
I think fine-tuned embeddings can be amazingly useful because they are reflecting your values more than some generic meaning.
I havenāt tried any internal embeddings myself, have you? How would you get these?
A while ago I started to set up a model where I take the Ada-002 embedding as an input (so 1536 floats) and feed it through my own deep classification network.
Would I take the vector in the final layer prior to classification as the embedding vector or something?
I havenāt messed with the internals of Llama to grab embeddings, which sounds like the paper did, any idea if this is easy?
I think fine tuned embeddings would be a logical addition to openAIās arsenal, especially now after fine tuning has been introduced.
I havenāt tried this with Llama specifically, but it should be fairly simple to get the data, I tried with a different model, and it just seemed to dump everything on my device.
ada-002 embeddings was pretty much released as a black box. The only thing we get is its dimensions, the 1536, in comparison to 1000-11000 of GPT-3 embeddings.
Given: GPT-4 training was done over a year ago. The embeddings could even be one of the sub-models that was trained on a specialization, some more optimized and a bit smaller according to a plausible gpt-4 article, and then another that seemed more insider and detailed.
It is interesting that both for embeddings, the codex that they worked on, and the choice of this astro project, that the jump from 13b to 175b, while giving a massive increase from ācuteā to āOMGā in language inference, often doesnāt benefit, and even penalizes, specialized use.
I have not. I havenāt played with LLaMA at all yet, and itās not possible with the OpenAI models (yet anyway).
It looks like itās incredibly straightforward with LLaMA 2.
I imagine OpenAI could (if they wanted to) include a parameter in the chat/completions/ API to return an embedding as an amalgamation of the final hidden layer, with or without continuing on to the generation phase. Who knows, maybe thatās how all the first generation text embedding models worked? They simply computed the hidden layers of the input, grabbed the last layer and reduced that to an embedding vector?
It would be absurdly more expensive than text-embedding-ada-002 to use a fine-tuned gpt-3.5-turbo just for embeddings, but⦠If it was possible to collect the embeddings while using the model⦠It might make sense.
The thing I donāt know about is if itās possible to get an equivalent embedding vector of the response basically āfor freeā in the same way it is for the input, if it is, that would change the math somewhat since you could get the (presumably) higher-quality embedding of a hypothetical response at additional no cost, which might make an upfront investment of seeding your VDB knowledge base with much more expensive embeddings worthwhile.
It would also greatly increase the value of fine-tuning a model because youād also get a fine-tuned embedding model essentially free.
All of that is predicated on the hypothetical that you could get these sorts of embeddings of the outputs without additional computationāwhich Iām skeptical of.