Is Llama 2 Better than ChatGPT?

Let’s compare “Llama 2” and “ChatGPT-4” in terms of their features, parameter sizes, and accessibility:

  1. Model Architecture:
  • Llama 2: Llama 2 is an auto-regressive language model that uses an optimized transformer architecture.
  • ChatGPT-4: ChatGPT-4 is based on eight models with 220 billion parameters each, connected by a Mixture of Experts (MoE).
  1. Parameter Sizes:
  • Llama 2: Llama 2 comes in a range of parameter sizes, including 7 billion, 13 billion, and 70 billion.
  • ChatGPT-4: ChatGPT-4 boasts a significantly larger parameter size with approximately 1.76 trillion parameters (eight models with 220 billion parameters each).
  1. Language Support:
  • Llama 2: Llama 2 is intended for use in English.
  • ChatGPT-4: The provided information doesn’t specify the language support for ChatGPT-4.
  1. Availability and Accessibility:
  • Llama 2: Llama 2 is open-source and freely available for commercial and research use, making it accessible to startups, established businesses, and lone operators without cost.
  • ChatGPT-4: The information provided states that ChatGPT-4 is a paid system, suggesting that it may have a commercial licensing model.

Considering the above information, we can draw some initial comparisons:

  • Model Size: ChatGPT-4 significantly outperforms Llama 2 in terms of parameter size, with approximately 1.76 trillion parameters compared to Llama 2’s largest version with 70 billion parameters.
  • Language Support: While Llama 2 is intended for use in English, the language support for ChatGPT-4 is not explicitly mentioned in the provided information.
  • Accessibility: Llama 2 has an advantage in terms of accessibility since it is open-source and available for free, while ChatGPT-4 is a paid system.
  • Llama 2 download is possible and you can run it in Windows also.

Ultimately, the choice between Llama 2 and ChatGPT-4 would depend on the specific requirements and budget of the user. Larger parameter sizes in models like ChatGPT-4 can potentially offer improved performance and capabilities, but the free accessibility of Llama 2 may make it an attractive option for those seeking a cost-effective solution for chatbot development.

Thanks for sharing the comparison.
I have been testing the Llama 2 for the last couple of days. So far it only works well with English and failed in any other languages, just like they advertised. However, the falcon models works very well on other languages. Currently, I am also testing on adopting using multiple models such as Falcon and Llama-2 together to achieve something similar to GPT-4. Once we have that going, then I think we will have a better comparison.

In terms of cost, I think GPT-4 is quick affordable unless you have extreme high usage. Running a single inference on models such as the Falcon40B and Llama-2-70B will cost at least 2K per month.

The customers we are working with want to switch to open source models because they want to keep their data inside their organization.

Just some of the observations I have seen in the last couple months.

3 Likes

I have tested the Llama2 models locally at various quantization levels, including the 70B model, on my Mac Studio.

But my bigger concern for production is hosting costs (needing high uptime, low-ish latency). Like @nelson said, $2K per month. You might be able to get that price down, but still it will cost more than the average API user.

Which is why API based models are so appealing (less cost than self hosting, and normally less hassles, with low-ish latency).

But Llama2 is fun to mess around with at home. Just running on CPU’s on a Mac Studio here.

1 Like

Very interesting.
@curt.kennedy what is the typical response time for 1K total tokens when running on your local machine?

@nelson

76 sec for 1k tokens using Llama2, 13 billion parameters @ 4 bits quantization

./main -t 16 -m ./models/llama2/llama-2-13b-chat.ggmlv3.q4_0.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p "[INST] <<SYS>> You are a helpful, respectful and honest assistant. <</SYS>> Write a story about llamas. [/INST]"
llama_print_timings:        eval time = 50167.78 ms /   645 runs   (   77.78 ms per token,    12.86 tokens per second)
llama_print_timings:        eval time = 57875.98 ms /   762 runs   (   75.95 ms per token,    13.17 tokens per second)
llama_print_timings:        eval time = 38968.11 ms /   510 runs   (   76.41 ms per token,    13.09 tokens per second)

84 sec for 1k tokens using Llama2, 13 billion parameters @ 8 bits quantization

./main -t 16 -m ./models/llama2/llama-2-13b-chat.ggmlv3.q8_0.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p "[INST] <<SYS>> You are a helpful, respectful and honest assistant. <</SYS>> Write a story about llamas. [/INST]"
llama_print_timings:        eval time = 41616.74 ms /   494 runs   (   84.24 ms per token,    11.87 tokens per second)
llama_print_timings:        eval time = 37273.97 ms /   444 runs   (   83.95 ms per token,    11.91 tokens per second)
llama_print_timings:        eval time = 55865.34 ms /   652 runs   (   85.68 ms per token,    11.67 tokens per second)

187 sec for 1k tokens using Llama2, 70 billion parameters @ 4 bits quantization

./main -t 16 -gqa 8 -m ./models/llama2/llama-2-70b-chat.ggmlv3.q4_0.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p "[INST] <<SYS>> You are a helpful, respectful and honest assistant. <</SYS>> Write a story about llamas. [/INST]"
llama_print_timings:        eval time = 88664.63 ms /   471 runs   (  188.25 ms per token,     5.31 tokens per second)
llama_print_timings:        eval time = 87636.06 ms /   470 runs   (  186.46 ms per token,     5.36 tokens per second)
llama_print_timings:        eval time = 105132.86 ms /   564 runs   (  186.41 ms per token,     5.36 tokens per second)

Note: For whatever reason, the 70B model is only using 2 CPU’s and 1.4 GB of memory, and the 13B model uses 16 CPU’s and 2.5 GB of memory.

I pulled the C++ code a few days ago, and recently updated to support the 70B model, and over time, hopefully they can speed up the 70B to utilize more of my computer resources.

Here is a sample output using the 70B model at 4 bits quantization:

Once upon a time, in the rolling hills of the Andes, there lived a group of llamas. These llamas were known for their soft, warm fur and their gentle dispositions. They spent their days roaming the green fields, munching on grass and enjoying the fresh mountain air.

One llama in particular, named Luna, was very curious. She loved to explore the surrounding hills and valleys, always looking for new adventures. One day, while wandering through a dense thicket of trees, Luna stumbled upon a hidden cave.

Inside the cave, Luna found a treasure trove of glittering crystals and shiny rocks. She had never seen anything like it before and was immediately captivated. She spent hours admiring the sparkling gems and even tried to imitate their colors by twirling her fur in different ways.

As the sun began to set, Luna reluctantly left the cave and returned to her herd. But she couldn’t stop thinking about the crystals and rocks she had seen. She told all her friends about her discovery, but they didn’t believe her.

“There’s no way you found a cave full of treasure,” said one llama. “You must have been seeing things.”

Luna was determined to prove them wrong. The next day, she led the herd to the cave, and they were all amazed by its beauty. Together, they explored every nook and cranny, marveling at the sparkling gems and shiny rocks.

From that day on, the llamas made regular visits to the cave, always discovering new hidden treasures. And Luna, the curious llama who had found it all, was hailed as a hero by her herd. She had shown them that even in their own backyard, there was still so much to explore and discover.

As the years went by, the llamas continued to visit the cave, and it became a special place for them. They would go there to celebrate special occasions, like birthdays and anniversaries, and they would always leave an offering of grass or leaves as a thank you to the cave for its treasures.

And Luna, well, she never lost her sense of curiosity and wonder. She continued to explore the world around her, always looking for new adventures and hidden treasures. But she never forgot the magical cave that had started it all, and she made sure to visit it often, remembering the day she discovered its secrets and the joy it had brought to her and her herd.

3 Likes

Fantastic explanation and metrics on your tests, thank you!
It’s good to be able to run locally for development purpose.
For real time response, we probably need GPU, however I can see there are opportunities where we can use CPU for post processing and even running them on spot instances.

1 Like

@mkussmaan @curt.kennedy
Just wanted to share with you that I have been able to run a production version of Llama2-70B using multiple Nvidia 100 GPUs. The inference response time is less than 30 seconds. In most cases, the Llama2-70B have better results than the GPT4 when it comes to English. GPT-4 is still much better for non-english languages.
If you like to the Llama2-70B a test drive, you can get free access here at https://www.superinsight.ai
You can switch between Llama2 and GPT models to compare your results.

1 Like

I downloaded meta Llama2 models locally then I tried to run the model 7B using this command “torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4” but when I run this command I got this error “Distributed package dosen’t have NCCL built in” so I installed NCCL 2.16.5 that could support cuda 11.8, but still not working, I want to find a way to run the script. I’m blocked for 3 days, please help if there a solution. thanks in advance.

Have you tried using this framework from hugging face?

1 Like