Tsinghua University Professor Zhai Jidong: The System Revolution Behind DeepSeek’s 100x Computational Efficiency

DeepSeek, a Chinese AI company, has developed a powerful model with only 2048 GPUs, challenging the focus on massive computational resources. Its success lies in optimizing algorithms and systems rather than scaling computing power.

At the start of 2025, DeepSeek took the global AI community by storm. While OpenAI announced its $500 billion "Gateway to the Stars" plan and Meta was building a data center with over 1.3 million GPUs, this Chinese team broke the traditional logic of the AI arms race. With just 2048 H800 GPUs, they trained a model in two months that rivals the best models worldwide.

This breakthrough has shaken even Nvidia’s trillion-dollar market value and sparked reflection across the industry: On the path to AGI, are we overly focused on scaling computational power, while overlooking a more pragmatic and innovative approach?

Unlike the "bigger is better" mindset of 2023, AI development in 2025 could become more like a technical alchemy—how to maximize model efficiency with minimal resources and achieve peak performance in specific scenarios. DeepSeek has already shown the power of this approach. Developers tend to choose cost-effective open-source solutions, and as countless applications are built on DeepSeek's foundation, the ecosystem they create could reshape the entire AI industry.

China’s AI media outlet "Machine Heart" invited Professor Zhai Jidong, a Tsinghua University computer science professor and director of the High-Performance Computing Institute, to discuss how AI computational power can be optimized in the era of large models. According to Professor Zhai, one of the key reasons DeepSeek achieved a 100x performance improvement was their deep innovation at the system software level.

“Performance optimization is an endless process,” said Professor Zhai. In a country facing resource challenges, increasing computational efficiency through software innovation is crucial to the industry’s breakthrough. This requires breakthroughs not only in programming languages, compilers, communication libraries, and frameworks, but also in building a complete software system.

A thought-provoking phenomenon today is that, despite the continuous rise in AI computational demand, many of China's intelligent computing centers remain underutilized. This supply-demand mismatch highlights a weakness in the foundational software systems.

But this challenge also presents an opportunity: How can we connect applications, system software, and homegrown chips to find a development path suited to China’s reality? This is not just a matter of technical innovation, but also a strategic decision.

As computational power becomes the key to AI competitiveness, how can we ensure that every ounce of computational resource delivers its maximum value? The answer to this question is just as important as the question itself.

Machine Heart: Professor Zhai, welcome to "Wisdom Interview" on Machine Heart. There’s been a lot of talk recently about new trends in the AI computational power market. First, many are asking whether Scaling Law has hit a wall. And secondly, with the launch of OpenAI's o1/o3 models, it seems increasing inference time can significantly improve model performance. This makes us rethink how to best allocate computational power.

We’ve seen a growing focus on improving computational efficiency, and we're excited to have you here to discuss this from the perspective of system software.

DeepSeek's Inspiration: Endless Pursuit of Performance Optimization

Zhai Jidong: Thank you for the warm welcome. It’s a great pleasure to be here. Dr. Ilya Sutskever once mentioned at a forum that the Scaling Law we know is nearing its end. I believe this can be examined from several angles. First, high-quality text data online is becoming scarcer, but there is still a lot of untapped potential in multimodal data, such as images and videos, which will have a huge impact on future model training.

Second, complex reasoning systems like OpenAI's o1/o3, which use reinforcement learning (RL) in post-training, are generating a lot of new data, contributing to a rise in computational demand. Third, training a base model today could take weeks or even months, but with more computational resources, a good model could be pre-trained in just days, greatly improving production efficiency.

Moreover, for end users, the pursuit of performance, including precision, is endless.

Machine Heart: DeepSeek has recently sparked widespread discussion in the industry for training a model comparable to top foreign models at a much lower cost. Based on publicly available information, where do you think the main improvements lie?

Zhai Jidong: The first key area is algorithmic innovation. They’ve adopted a new MoE architecture with shared experts and many fine-grained routing experts. By compressing general knowledge into shared experts, they reduce redundant parameters and improve parameter efficiency. They’ve also divided more fine-grained routing experts, allowing for more flexible and targeted knowledge expression. Furthermore, their load balancing algorithm alleviates the training inefficiency caused by traditional MoE models’ load imbalance.

On the system software side, DeepSeek has implemented a range of refined system engineering optimizations. For example, in parallel strategies, they use a bidirectional pipeline parallelism mechanism, which overlaps computation and communication, reducing the impact of pipeline parallelism. In terms of computation, they use mixed precision like FP8 to lower computational complexity. In communication, they use low-precision communication strategies and token routing control to reduce communication overhead.

These algorithmic and software innovations significantly reduced training costs. DeepSeek shows us how, even with limited computational power, algorithmic and software collaboration can fully unleash hardware performance—a crucial lesson for China's AI future.

Machine Heart: With the limited computational resources in China, does pursuing performance optimization conflict with model innovation? Can they coexist?

Zhai Jidong: From the system software perspective, it’s decoupled from algorithm development. These optimization techniques are applicable in environments with more computational resources as well. They can be applied in research settings in the U.S. and won’t hinder the development of higher-level models.

Machine Heart: There doesn’t seem to be an objective evaluation standard for computational power efficiency yet. From your perspective, how should we scientifically and objectively assess computational power utilization?

Zhai Jidong: That’s a great question. Some reports mention "GPU utilization," but it’s difficult to evaluate a system’s performance with a single metric, much like evaluating a person’s performance by just one dimension.

In large model training, GPU utilization is just one factor. Large clusters also include network and storage devices. Focusing only on GPU utilization while neglecting network and memory efficiency isn’t ideal. We should aim for a balanced approach, optimizing network and memory use to reduce GPU consumption.

Metrics also vary depending on the scenario. In training, we focus on the overall cluster (including accelerators, storage, network, etc.), while in inference, users care about latency and throughput. An often-overlooked but important metric is cost, particularly the cost per token. When we incorporate cost constraints, discussions on system throughput and latency become more practical. Lowering inference costs will be essential for widespread AI adoption.

Machine Heart: Given the hardware differences between China and the U.S., there’s concern over whether we’ll develop separate software stacks or "tech trees." What’s your take on this?

Zhai Jidong: China faces different challenges in system software. In the U.S. and Europe, AI infrastructure is mostly built around NVIDIA GPUs. But in China, accessing the latest NVIDIA GPUs is a challenge.

NVIDIA's GPUs are popular because of their mature ecosystem. Their software stack has been developed over many years of collaboration with institutions like Tsinghua. In contrast, many of China’s AI chip companies have only been around for a few years, and there’s still a long way to go in optimizing compilers and multi-GPU communication.

We face a dual challenge: improving the ease of use of Chinese chips while compensating for lag in chip manufacturing processes. This makes system software optimization in China even more crucial.

Original:https://mp.weixin.qq.com/s/Elby5usJVFjEHU45MNDYWA