Investigating Google Pathways - Switch Transformers (Argmax MoE) + Sharding - Maths behind it

Hello everyone! You might / might not have heard Google Pathways Introducing Pathways: A next-generation AI architecture.

Google seems to claim that Pathways is some “new” huge huge model which models multiple tasks efficiently. They haven’t released it, however, it seems at first glance they’re essentially making an efficient version of WuDao. In WuDao, they used the Mixture of Experts approach (MoE), essentially after Multi Head Attention, we use M Dense layers for M tokens (ie each Dense layer “specializes” in say verbs, nouns etc.

The routing is done via a softmax, so MoE is a weighted average of every M FC layer.

In Switch Transformers ([2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity), instead of doing a weighted average of M dense layers, instead we take the ARGMAX and just use ONE FC layer. Quite ingenious.

In Sharding ([2006.16668] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding), due to MoE having M dense layers (ie M = 2048 in GPT3 for 2048 tokens) and not 1 originally, we make K GPUs have M/K dense layers separately. Likewise, preloading data, etc and using a bfloat16 to float32 casting trick is pretty neat.

It’ll be quite fascinating if FNet ([2105.03824] FNet: Mixing Tokens with Fourier Transforms) ie replacing Attention with 2 FFTs and using a smart Real() trick to discard the imaginary part will be used as well.

Likewise, distillation ([1503.02531] Distilling the Knowledge in a Neural Network) + the lottery ticket hypothesis ([1803.03635] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks) could be integrated into Pathways.

From the looks of it, Pathways combines multiple “tricks” and will in fact maybe have say multiple TRILLIONS of paramaters.

ANYWAYS, was just asking for everyone’s comments / analysis on Pathways. Is my understanding flawed or somewhat correct?

Likewise, I shared previously Mathematics behind GPT3 - Masked Multihead Self Attention about the maths behind Self Attention. I was a bit inundated recently, but if anyone’s interested to chat more / or I could even post here, I was gonna go over backprop, MoE, Switch Transformers, Sharding, Distillation, Lottery Ticket Hypothesis, FNet etc.

Likewise, I shared the stuff I’m working on with my bro - Moonshot - Predicting the future and making JARVIS! - although Pathways’s premise is somewhat similar (ie trying to create world models), Moonshot’s aim is to forecast futures of everything and give back insight on why the future will be like the way it is. If anyone is interested in helping, feel free to msg me!


This is merely biomimetic in nature. The human brain “recycles” neural circuits, coopting them as a form of “deduplication”. It does this for tasks, memories… everything. From a neuroscience perspective this is just like “Well, yeah”. Still a cool innovation.

1 Like

Ye sounds about right! It’s quite interesting to see people combine ideas up together!