Closed-Door Discussion Between Chinese and American AI Entrepreneurs: Changes and New Trends in AI Startups After DeepSeek-R1
Where does DeepSeek's innovation lie? Why are DeepSeek's costs so low? This article will explore these questions in detail.
DeepSeek has undoubtedly become the focal point at the beginning of 2025. From topping the free app charts on the Apple App Store to various cloud providers rushing to deploy DeepSeek-R1, DeepSeek has even become the first AI product for many users to experience.
On February 2, Founder Park organized a closed-door discussion, inviting over 60 founders and tech experts from AI companies across Silicon Valley, China, London, Singapore, Japan, and other regions. The discussion covered technical innovations, product implementation, the shortage of computing power, and other aspects, providing an in-depth exploration of the new technical directions and product trends triggered by DeepSeek.
Below are the key takeaways from this closed-door discussion.
Where is the Innovation in DeepSeek?
DeepSeek released its V3 base model at the end of December, one of the most powerful open-source models in the industry, featuring 37 billion active parameters and a total parameter scale of 671 billion. It is a large MoE (Mixture of Experts) model.
The "Aha moment" of the R1 model, released in January 2025, refers to the model's ability to exhibit a certain degree of self-reflection during inference. For example, during problem-solving, the model might realize that a certain method is no longer applicable and adjust to a more effective approach. This self-reflection ability comes from reinforcement learning (RL).
R1 is DeepSeek's flagship model, and its reasoning capabilities are on par with OpenAI's o1. The implementation can be summarized as follows: R1 uses two steps of reinforcement learning and two steps of supervised fine-tuning (SFT). The RL and SFT in the first two steps are mainly used to build a teacher model for data generation, guiding the third step of data generation. This model aims to be the most powerful reasoning model currently available.
The core innovation of the DeepSeek R1-Zero model lies in skipping the traditional fine-tuning (SFT) process and directly optimizing reasoning through reinforcement learning (RL). Additionally, using DeepSeek R1 as the teacher model to distill smaller open-source models (such as Qwen1.7B/7B/14B/32B) can significantly improve the capabilities of smaller models.
In terms of coding ability, DeepSeek's R1 is on par with OpenAI’s newly released o3 mini, with o3 mini being slightly stronger overall. The difference is that R1 is open-source, which encourages more applications to adopt it.
The core of DeepSeek's success lies in a highly integrated engineering solution that reduces costs. When breaking down their methods, each approach can be found in last year's papers, but DeepSeek applies the latest methods aggressively. These methods themselves may have side effects, such as extra storage overhead, but they significantly reduce cluster idle time.
For a model designed to serve a large-scale audience, the MLA architecture might actually have side effects. Many of DeepSeek's methods, if not applied in specific scenarios and environments, fail to achieve maximum performance optimization. If these techniques are used independently, they may actually backfire. Their system design is intricate enough that taking any of these techniques separately would not produce the same results.
It’s not advisable to train only a process reward model because doing so could lead to overfitting, and the final results may not meet expectations. DeepSeek opted for the most original reinforcement learning method, using heuristic rules to score the final results, then using traditional reinforcement learning to adjust the process. This approach was developed through trial and error, aided by DeepSeek's highly efficient infrastructure.
Even though DeepSeek hasn’t released its inference code, other teams can roughly deduce which methods were used. The open-source model weights are sufficient for other teams to replicate its performance, but the challenge lies in figuring out how to replicate some of the specific configurations, which requires time.
Reward models based solely on data annotations are unlikely to achieve superhuman intelligence. A reward model based on real-world data or feedback from real environments is needed to realize more advanced reward optimization and, ultimately, superhuman intelligence capabilities.
Technical Speculation: If the base model itself has strong generalization capabilities, combining this with mathematical and coding abilities will enhance its generalization power. For instance, if there is an intelligent base model that performs well in writing, adding reinforcement learning in mathematics and coding could result in strong generalization, ultimately leading to very powerful abilities. Specifically, this could enable the model to write a range of works, from parallel prose to regulated poems, whereas other models might not perform as well in this area.
Why is DeepSeek’s Cost So Low?
The model's sparsity is extremely high. Despite being a model with over 600 billion parameters, the actual number of activated parameters for each token during inference is quite small, only 37 billion. This means that during inference, its speed and resource consumption are similar to those of a 37-billion-parameter model. However, achieving this requires significant design changes across the entire system.
In DeepSeek V3, the MoE architecture contains 256 expert modules, but only a small subset of them is activated during each inference. Under high load, it can dynamically adjust resource usage, theoretically reducing costs to 1/256 of the original. This design reflects DeepSeek’s forward-thinking approach to software architecture. If system optimization is done well, prices can be significantly reduced at the same scale.
Model training generally involves three parallel axes, or “axes of the axe,” which refer to parallel partitioning in three dimensions. The first is data-level parallelism, known as Data Parallelism. The second is model-level parallelism, since the layers of a model are independent of each other, partitioning occurs at this level, known as Pipeline Parallelism. The third is partitioning the model’s weights across different GPUs, called Tensor Parallelism. To accommodate the sparse model design, DeepSeek made significant adjustments to the training framework and pipeline. During training, they discarded Tensor Parallelism, relying solely on Data Parallelism and Pipeline Parallelism, and further refined expert parallelism (Expert Parallelism). By finely partitioning the number of experts (up to 256), DeepSeek distributes different experts across different GPUs. Moreover, by discarding Tensor Parallelism, they bypass hardware limitations, allowing the training performance of H800 and H100 GPUs to be nearly identical.
In terms of model deployment, experiments show that its computational costs are controllable, and the technical difficulty is not high. Typically, replication can be completed in just one to two weeks, which is very advantageous for many application developers.
A potential model architecture: Instead of limiting reasoning RL to large language models themselves, adding an external "thinking machine" to complete the entire reasoning capability could reduce the overall cost by several orders of magnitude.
Vertical Scene AI Deployment Becomes Easier
For relatively vertical tasks, task evaluation can be achieved through rule systems without relying on complex reward models. For predefined vertical tasks, models like Tiny Zero or a 7B model can quickly generate usable results.
For a predefined vertical task, training a model distilled from DeepSeek with 7 billion parameters or larger can quickly reach the "Aha moment." From a cost perspective, performing simple arithmetic tasks or tasks with clear answers, like 21-point games, on a 7B model requires only 2-4 H100 or H200 GPUs, and the model can converge to a usable state in less than half a day.
In vertical fields, especially for tasks with clear answers, such as mathematical calculations or physical rule judgments (e.g., object placement, determining if motion follows certain rules), DeepSeek R1's performance is indeed superior to other models, and its cost is controllable. Therefore, it can be applied widely in vertical fields. However, for tasks without clear answers, such as judging whether something is aesthetically pleasing or whether an answer makes people happy, these more subjective evaluations cannot be effectively solved through rule-based methods. This area may require waiting three to six months for better methods to emerge that can address these issues.
When using supervised fine-tuning (SFT) or similar methods, it's challenging to solve time-consuming dataset queries, and the domain distribution of these datasets often struggles to cover all levels of a task comprehensively. Now, with a new and better toolset and a high-quality model, it is possible to address the previous difficulties in data collection and vertical tasks with clear answers.
While rule-based systems can define relatively clear rules for tasks like mathematics and coding, they become very difficult to apply to more complex or open-ended tasks. As a result, researchers may eventually explore more suitable models to evaluate the results of these complex scenarios. Methods like ORM (Outcome-Driven Reward Function) might be explored instead of PRM (Process-Driven Reward Function), or other similar approaches. Ultimately, a "world model"-like simulator might be built to provide better feedback for decision-making across various models.
When training inference abilities with smaller models, there is even no need to rely on token-based solutions. For a specific e-commerce solution, the entire reasoning capability was separated from a Transformer-based model and handled by a smaller model, combined with the Transformer to accomplish the entire task.
For companies developing models for their own use (such as hedge funds), the challenge lies in the cost. Large companies can offset costs by bringing in clients, but small teams or companies struggle to bear the high development costs. DeepSeek’s open-source approach is highly significant for them, as teams previously unable to afford high R&D costs can now build models.
In the financial sector, especially in quantitative funds, analyzing large volumes of financial data, such as company financial reports and Bloomberg data, is common. These companies typically build their own datasets and perform supervised training, but the cost of data labeling is very high. For these companies, the application of reinforcement learning (RL) during the fine-tuning stage can significantly improve model performance, leading to a qualitative leap.
More Powerful Agents and Cross-Application Invocation Capabilities
For many vertical fields, the capabilities of agents will experience significant improvement. A basic model can be taken first, and some rules can be structured into a rule model, which may be a purely engineering solution. This engineering solution can then be used to allow the base model to iterate and train on it. You might obtain a result where some superhuman intelligence capabilities start to emerge. Based on this, further preference tuning can be done to make its responses more human-readable, leading to a stronger reasoning agent within a specific vertical domain.
However, this may bring up a potential issue: you may not have an agent with strong generalization abilities across all vertical fields. After training an agent in a specific field, it may only function effectively within that domain and cannot generalize to other vertical fields. But this is a possible (practical) direction, as DeepSeek’s inference cost is very low. One can choose a model, perform a series of reinforcement training, and after training, the agent will serve a specific vertical domain, without concerning itself with others. For vertical AI companies, this could be an acceptable solution.
From an academic perspective, an important trend over the next year is that some existing methods in reinforcement learning will be transferred to large model applications to solve current issues of insufficient generalization or inaccurate evaluations. In this way, model performance and generalization ability can be further improved. With the application of reinforcement learning, the ability to output structured information will greatly enhance, ultimately better supporting various application scenarios, especially improving the generation of charts and other structured content.
More and more people will be able to use R1 for post-training, enabling everyone to create their own agents. The model layer will evolve into different agent models, each using different tools to solve problems in different domains, ultimately achieving a multi-agent system.
2025 could mark the dawn of the agent era, with many companies launching agents capable of task planning. However, there is currently a lack of sufficient data to support these tasks. For example, planning tasks could involve helping users order food, book travel, or check ticket availability for tourist attractions. These tasks require massive data and reward mechanisms to evaluate the model’s accuracy, such as how to plan a trip to Zhangjiajie, judge what is correct or incorrect, and how to train the model. These challenges will become research hotspots in the near future, and reasoning capabilities will ultimately be used to solve practical problems.
In 2025, cross-application invocation capabilities will become a hot topic. In the Android ecosystem, due to its open-source nature, developers can achieve cross-application operations through low-level permissions. In the future, agents will be able to control your browser, phone, computer, and other devices. However, in the Apple ecosystem, due to strict permission management, agents face significant challenges in fully controlling all applications on devices. Apple must develop its own agents capable of controlling all apps. While Android is open-source, it still requires collaboration with manufacturers like OPPO and Huawei to achieve the opening of low-level permissions on devices like smartphones, tablets, and computers, enabling data acquisition and supporting the development of intelligent agents.