Thanks for sharing the comparison.
I have been testing the Llama 2 for the last couple of days. So far it only works well with English and failed in any other languages, just like they advertised. However, the falcon models works very well on other languages. Currently, I am also testing on adopting using multiple models such as Falcon and Llama-2 together to achieve something similar to GPT-4. Once we have that going, then I think we will have a better comparison.
In terms of cost, I think GPT-4 is quick affordable unless you have extreme high usage. Running a single inference on models such as the Falcon40B and Llama-2-70B will cost at least 2K per month.
The customers we are working with want to switch to open source models because they want to keep their data inside their organization.
Just some of the observations I have seen in the last couple months.