7 Largest AI Models Ever Trained by Parameter Count

7 Largest AI Models Ever Trained by Parameter Count

Ranking the seven largest AI models by parameter count, from Switch Transformer (1.6T) to GPT-3 (175B), with specs and training compute.

6 min. čítaniaAktualizované 6/2026
Elena Vasquez
Elena Vasquez

The largest AI models ever trained now reach over one trillion parameters, pushing the boundaries of language understanding, reasoning, and generation. This guide ranks the seven biggest models by confirmed parameter count, with details on architecture, training compute, and publication date — figures are based on publicly available data as of early 2026 and may have changed.

1. Switch Transformer (1.6 Trillion Parameters)

Google’s Switch Transformer, introduced in January 2022, remains the largest confirmed dense model — but it achieves its scale through mixture-of-experts (MoE) design, where only a fraction of parameters are active per inference. It uses 1.6 trillion total parameters with a top-1 routing mechanism, selecting one of 2048 experts per token. Switch Transformer achieved a 4x speedup in training over previous MoE models while maintaining competitive perplexity on C4 and SuperGLUE benchmarks. Despite its size, inference is feasible because only about 9.5 billion parameters are activated per token.

2. GLaM (1.2 Trillion Parameters)

Google’s Generalist Language Model (GLaM), described in a December 2021 paper, packs 1.2 trillion parameters across 64 experts using MoE architecture. GLaM was trained on 1.6 trillion tokens from web pages, books, and news, and achieved strong zero-shot and one-shot results across 29 NLP tasks. Despite having 7× more parameters than GPT-3, GLaM required only 1/3 of the training energy due to its sparse activation. The model was never publicly released, but its architecture influenced later MoE designs.

3. PaLM (540 Billion Parameters)

Google’s Pathways Language Model (PaLM), announced in April 2022, is a dense 540-billion-parameter transformer trained on 780 billion tokens. PaLM used a 6,144 TPU v4 chip cluster and demonstrated few-shot reasoning breakthroughs on BIG-bench, mathematical problem-solving (GSM8K), and code generation (HumanEval). Its scaling curve showed that larger models continue to benefit from increased training data. PaLM was later succeeded by PaLM 2 (which has an undisclosed parameter count) and eventually Gemini.

Chart tracking the growth of parameter counts across major AI models from 2018 onwards, showing the exponential increase

4. Megatron-Turing NLG (530 Billion Parameters)

NVIDIA and Microsoft jointly developed Megatron-Turing NLG (MT-NLG) in October 2021, a dense 530-billion-parameter model. It set benchmarks in natural language generation, reading comprehension, and commonsense reasoning. Trained using NVIDIA's Megatron-LM for tensor parallelism and Microsoft’s DeepSpeed for pipeline parallelism, MT-NLG was at the time the densest trained model. It demonstrated that scaling dense architectures could yield consistent gains without the complexity of MoE.

5. Llama 3 (405 Billion Parameters)

Meta’s Llama 3 405B, released in July 2024, is a dense 405-billion-parameter model and the largest fully open-source model in its class. It was trained on over 15 trillion tokens from publicly available data, including web pages, code, and multilingual content. Llama 3 405B achieves results competitive with GPT-4 on many benchmarks (MMLU, HumanEval, GSM8K) while being freely available for download and fine-tuning. Its open release has accelerated research and deployment across industries, including used robotics applications where models run on edge hardware.

6. BLOOM (176 Billion Parameters)

The BigScience Large Open-science Open-access Multilingual (BLOOM) model, released in July 2022, is a 176-billion-parameter decoder-only transformer trained collaboratively by over 1,000 researchers. It was trained on 366 billion tokens across 46 natural languages and 13 programming languages. BLOOM is one of the largest truly open-weight models, enabling reproducible research. Its training used the Jean Zay supercomputer and the Megatron-DeepSpeed framework.

7. GPT-3 (175 Billion Parameters)

OpenAI’s GPT-3, described in the June 2020 paper “Language Models are Few-Shot Learners,” has 175 billion parameters and launched the modern scaling race. It demonstrated few-shot and zero-shot capabilities across translation, question-answering, and text generation. GPT-3 was trained on 570 GB of text from CommonCrawl, WebText, books, and Wikipedia. Despite being surpassed in size, GPT-3’s influence is unmatched — it proved that scaling models (and data) dramatically improves task performance and opened the door to commercial APIs like ChatGPT.

Comparison Table: 7 Largest AI Models

ModelParameter CountArchitectureTraining ComputeYear ReleasedOpen Source
Switch Transformer1.6T (9.5B active)MoE (2048 experts)~1,200 TPU-days2022No
GLaM1.2T (64 experts)MoE~4,900 TPU-days2021No
PaLM540BDense Transformer~8,600 TPU-days2022No
Megatron-Turing NLG530BDense Transformer~6,500 GPU-days (NVIDIA A100)2021No
Llama 3405BDense Transformer~30.8M GPU-hours (H100)2024Yes
BLOOM176BDense Transformer~3.5M GPU-hours (A100)2022Yes
GPT-3175BDense Transformer~1.5M GPU-days (V100)2020No

Frequently Asked Questions

The largest confirmed model is Google’s Switch Transformer at 1.6 trillion parameters, though GPT-4 may be larger (unconfirmed). Newer models like DeepSeek V2 (Mixture-of-Experts) have also claimed high parameter counts.

Why do some trillion-parameter models only use a fraction of parameters at inference? MoE architectures activate only a few experts per token, so inference remains efficient even with trillion-scale total parameters — active parameters often stay in the low billions.

How are these models relevant to robotics? Large language models are increasingly used as the cognitive layer in humanoid robots, enabling natural language commands, task planning, and code generation for manipulation.

Is there a limit to how large AI models can become? Physical constraints like compute, memory bandwidth, and energy consumption impose limits, but research into sparse attention, quantization, and distributed training continues to push the boundary.

Which of these models is available for commercial use? Llama 3 405B and BLOOM are open-source under permissive licenses suitable for commercial use. GPT-3 is closed-source but accessible via OpenAI’s API.

Conclusion

The race to build ever-larger AI models has seen parameter counts skyrocket from 175 billion in 2020 to over 1.6 trillion by early 2026. MoE architectures enable trillion-scale models without proportional compute cost, while open-source models like Llama 3 405B democratize access. Future breakthroughs may come not from raw size alone, but from more efficient training, better data, and specialized architectures that combine reasoning with real-world interaction — exactly what robotics demands.

Do you think sparse MoE models or dense models will dominate the next generation of large AI systems?

Zapojte sa do diskusie

Do sparse MoE or dense models scale better for real-world robotics applications?

🍪 Preferencie súborov cookie

Používame súbory cookie na meranie výkonu. Zásady ochrany osobných údajov