What Is Mixture of Experts (MoE) and how does it make DeepSeek far more efficient than other models?

via Coldfusion:

OpenAI’s latest model, GPT-4, attempts to be Einstein, Shakespeare, and Picasso rolled into one. DeepSeek, on the other hand, is more like a university divided into expert departments. This structure allows the AI to identify the type of query it receives and delegate it to a specialized part of its digital brain. The remaining sections stay switched off, saving time, energy, and most importantly, computing power.

A useful way to think about this is through the analogy of a call center Imagine a call center receiving all kinds of inquiries daily. If a single human had to manually handle each call before deciding where to direct it, it wouldn’t be the most efficient approach.

Instead, an automated system prompts the caller to choose the type of support they need. Only after making a selection is the call routed to the appropriate department or human representative.

This Mixture of Experts (MoE) technique enables DeepSeek’s models to perform at a superior level compared to other open-source models while also significantly reducing the cost of training.

via the paper, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model:

In practice, compared to DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance while simultaneously reducing training costs by 42.5%, cutting KV cache usage by 93.3%, and increasing maximum generation throughput by 5.76 times.

I’ll take more efficiency for less cost any day.