I get a bit skeptical when these guys talk about RL. I mean, super obvious idea, why are they talking about it now?
either, a) theyāve been deceptively not talking about it, or b) they are out of the loop relative to where they should be
Either way, reduces my trust in them quite a bit
If they said something like āwow, I really flaked on thisā Iād probably trust them more
That said, I like this quote:
If the U.S. continues to stymie open source, China will come to dominate this part of the supply chain and many businesses will end up using models that reflect Chinaās values much more than Americaās.
Though I can see OS getting banned.
TBH, Iām getting a bit skeptical of this RL stuff. I mean, super obvious idea. Pretty sure everyone tried it. I remember calling it āRLCFā for reinforcement learning compiler feedback like way back.
I just did a search, I wasnāt the only one that came up with that term - [2305.18341] Coarse-Tuning Models of Code with Reinforcement Learning Feedback
Like I say, pretty trivial idea.
If RL post training is a big deal, we should see some quick jumps in the huggingface leaderboard very soon.
Fwiw, I also did some tests here - GRPO Llama-1B Ā· GitHub
All I saw was some improvement in formatting, admittedly it was just about 2 hours of training on an h100