Near the end of 2023, there was a buzz around the French Startup company Mistral [1] after they released an open-source model rivaling ChatGPT in performance. In particular, one of their more powerful models is named "Mixtral of Experts" [2]. This model is called a "Sparse Mixture-of-Experts" model (or SMoE), but what is that? In this article, we will explore the Mixture-of-Experts models and discuss the idea behind the gating mechanism used by the Sparse Mixture-of-Experts. We will also discuss the use of Mixture-of-Experts models in the Transformer architecture.

## What are Mixture-of-Experts?

The first version of Mixture-of-Experts (MoE) was presented in the seminal work "Adaptive Mixture of Local Experts" [3], where the authors explore an idea that was affine to that of ensemble learning methods: have a supervised procedure for a system composed of separate networks, or experts, each handling a different subset of the training cases. In this scenario, each "expert network" specializes in a different region of the input space. The combination of these networks in the final output of the model is handled by a gating mechanism trained alongside the experts. Initially, the experts were supposed to take on complex tasks, such as math or logic experts. However, the truth was that it took more work to obtain quality models following that approach.

A couple of years ago, there was an exciting breakthrough in the area of natural language processing. The work of Shazeer et al. [4] explored the idea of "token" level experts within an LSTM architecture (which was standard at the time) by providing an MoE layer that routed the token to the corresponding expert:

In their most basic form, the experts are simple feed-forward networks. However, they can be more complex networks, including other MoEs, creating a hierarchy within the network.

The tricky part with this architecture lies in the training of the gating mechanism; if the training is done naively, it can end up hurting the performance and increasing the computational complexity of the network. The authors dealt with this problem with the help of sparsity and conditional computation to keep the model highly performant even on a larger scale. MoEs have allowed the training of models in the size of the trillion parameters [5].

## What are Sparse Mixture-of-Experts?

So, as we explained previously, the Mixture-of-Experts (MoE) models depend on a good implementation of the gating mechanism. Otherwise, they can result in a costlier computation overall. Sparse Mixture-of-Experts explores the idea of conditional computation. Since the signal that goes through each expert, no matter how small it is, activates the computation of the function it goes through, the authors of the Sparse Mixture-of-Experts (SMoE) model used an auxiliary function to control the passing of information:

When an input goes through the MoE layer above, it's effectively passed through all the experts E. However, the gating function G is trained in a way that results in 0 to certain inputs on certain experts. In this case, the experts in which the gating function equals zero are not computed.

The gating function can be anything that maps to zero under certain conditions; a straightforward solution could be a Softmax function. However, the authors of the SMoE paper proposed some tweaks to it:

Where H is the function that weights the value of the input x and adds some noise to load balance the use of experts and not select the same expert on every single input:

Finally, from that output, we select the Top K experts with the KeepTopK function. In practice, this value should be low (1 or 2) to avoid having the computation of multiple experts. In their original work, they propose K = 2 to teach the gating function better how to route to different experts:

As you can see, the elements that are not part of the top K will have a Softmax value of 0, effectively conditioning the computation.

### Load Balancing of Experts

The authors observed that even after the added noise in the gating mechanism, the network still converged to a state where it favored the same few experts by assigning them large weights. The imbalance is self-reinforcing since the favored experts are trained more rapidly than the others, which in turn makes the gating function select them more often overall. To overcome this problem, the authors designed an extra loss function. First, they define the importance of an expert relative to a batch of training examples to be the batch-wise sum of the gate values for that expert:

Using the importance value, they can calculate the auxiliary loss function as the square of the coefficient of variation of the set of importance values, multiplied by a hyperparameter scaling factor:

The coefficient of variation is the ratio of the population's standard deviation divided by the population's mean. In this case, the population is the set of importance values, which means we are finding a value that represents how much each expert is used, where a select few experts are used to create a big value, and all of them are used to create a small value.

## Adding Transformers to the Mix

### GShard and Scaling Computations

The Mixtral of Experts model architecture is based on the Transformer, not LSTMs. They similarly based their architecture on that of GShard [6], which replaces every other FFN layer with an MoE layer using top-2 gating in both the encoder and the decoder:

In the case of Mixtral of Experts, they replace every FFN sub-block with MoE layers. As you can see from the previous image, this kind of architecture is distributable on multiple devices, thus making it beneficial for large-scale computing: When scaled to multiple devices, the MoE layer is shared across devices, and all the other layers are replicated.

The GShard authors made a couple of changes to the training process. For one, they added random routing, where if we have a top-2 expert selection, the second expert can be picked with a probability proportional to its weight. They also introduced the concept of expert capacity, where each expert is assigned a fixed threshold of maximum tokens to process; thus, if an expert is at maximum capacity, the token is derived from the other expert.

### Switch Transformer

A more recent work that explores the use of the Mixture-of-Experts in the Transformers architecture is the Switch Transformer [5] model, which scales to even more parameters than GShard. The authors released the model publicly via Hugging Face [7], with 2048 experts totaling 1.6 trillion parameters.

The authors replace each FFN layer with a MoE layer; in this case, there's a special "Swith Transformer Layer" that receives two inputs (i.e., two different tokens) and has four experts to choose from. Unlike GShard and the SMoE case, there is no need to select the top 2 experts; instead, it is necessary to use a single-expert strategy. This decision has the effect of reduced computation by the routing function with preserved quality.

The authors of the Switch Transformer paper also explore the expert capacity constraint. It is defined as the ratio between the number of tokens divided by the number of experts multiplied by a capacity factor, which is a hyperparameter. In general, that capacity factor is kept low and close to one since a larger value will lead to more expensive intercommunication between the different experts.

## What are experts learning?

Zoph et al. [8] explored in more detail what each expert in a Mixture-of-Experts model is learning. They observed that encoder experts are more specialized than decoder experts. The experts usually specialize in a group of tokens or shallow concepts, such as punctuation, proper nouns, numbers, adjectives, etc. They also explored this in a multilingual setting, but contrary to what one might believe, because of the load balancing, there weren't any experts for the different trained languages.

The following is a table showing some of the examples of what each expert specializes in according to the work of Zoph et al.:

## Sparse Mixture-of-Experts vs. Classic Dense Models

One thing to consider when deciding when to use a Sparse Mixture-of-Experts vs the classic Dense Models is the throughput. If we have a fixed computational budget for pre-training a model, chances are that we will get more of that pre-training with an SMoE model than a dense model, primarily due to the fact that it has a better ratio of cost to final quality. If that budget doesn't exist and we can have the model trained for a much larger period of time, the dense model will likely outperform the SMoEs. Also, training this type of SMoE model requires much more VRAM; thus, under constraints on those resources, the dense model will also be a better alternative.

One thing to consider here is that even though GShard had 600 billion parameters, compared to the 175 billion parameters that GPT-3 had, the reality is that these numbers are not comparable since they represent very different things.

## Where to check different Mixture-of-Experts?

Hugging Face provides a large collection of pre-trained models in its repository [9]; these are some of the most popular ones:

## Concluding Remarks

Mixture-of-Experts, and in particular their Sparse version, provide a very ingenious way to expand the number of parameters while maintaining the computational cost at a constant level.

This has shown impressive results with the release of Mixtral of Experts by Mistral AI, which, I have to add, is an Apache-licensed model. Their results are comparable to those of GPT-3.5, and they have both a regular large language model and an instruct-tuned one. They also outperform LLAMA 70B with roughly two-thirds of the parameters and, in fact, have become the top non-proprietary licensed model in existence.

On the other hand, models like the Switch Transformer are still fresh, and there are a lot of paths to explore in terms of model capability and limits. Especially after the impressive results of Mixtral of Experts, there will be many more models following suit and advancing in these techniques.

## References

[1] Mistral AI. Frontier AI in your hands. https://mistral.ai/

[2] Mixtral of Experts. A high-quality Sparse Mixture-of-Experts. https://mistral.ai/news/mixtral-of-experts/

[3] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural computation, 3(1), 79-87.

[4] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.

[5] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1-39.

[6] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z. (2020). Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.

[7] Hugging Face Model Repository. Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB). https://huggingface.co/google/switch-c-2048

[8] Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906.

[9] Hugging Face (2024). Mixture-of-Experts Models. https://huggingface.co/models?other=moe&sort=trending

[10] Hugging Face. Switch Transformers Release. https://huggingface.co/collections/google/switch-transformers-release-6548c35c6507968374b56d1f

[11] Hugging Face. NLLB-MoE Model Card. https://huggingface.co/facebook/nllb-moe-54b

[12] Hugging Face. Fuzhao Xue. https://huggingface.co/fuzhao

[13] Hugging Face. Mistral AI. https://huggingface.co/mistralai

[14] Hugging Face. Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face. https://huggingface.co/blog/mixtral

## Коментарі