Recent arXiv Papers on O1 and DeepSeek-R1 – Summary and Analysis
Introduction
OpenAI’s O1 (often referenced as OpenAI-o1-1217) is a state-of-the-art large language model, while DeepSeek-R1 is a newly released open-source reasoning LLM from DeepSeek-AI (January 2025). Over the past month, several arXiv papers have analyzed these models. We review the latest papers that mention both O1 and DeepSeek-R1, summarizing their main findings – including each paper’s contributions, experimental results, and key insights. We also compare the strengths and weaknesses of O1 and DeepSeek-R1 based on these analyses, and suggest promising research directions emerging from the findings.
DeepSeek-R1: Incentivizing Reasoning in LLMs via RL (DeepSeek-AI, Jan 2025)
Contributions & Approach: This paper introduces two “first-generation” reasoning models, DeepSeek-R1-Zero and DeepSeek-R1, trained with a novel reinforcement learning (RL) pipeline
arxiv.org. Notably, DeepSeek-R1-Zero is trained purely via large-scale RL without any supervised fine-tuning (SFT) beforehand. This approach allows the model to autonomously explore multi-step chain-of-thought reasoning. The authors report that DeepSeek-R1-Zero developed powerful reasoning behaviors (e.g. self-verification, reflection, very long solutions) purely from RLarxiv.org. This is the first open evidence that an LLM’s reasoning ability can be incentivized entirely through RL without pre-training or SFTarxiv.org. However, the RL-only model had issues: its answers were often hard to read and mixed languages (e.g. interleaving Chinese/English)arxiv.org. To address these weaknesses, the team introduced DeepSeek-R1, which uses a multi-stage training pipeline: a small amount of “cold-start” supervised data to seed good behavior, followed by two RL fine-tuning stages (for reasoning skills and for alignment to preferences)arxiv.org. They also performed knowledge distillation from DeepSeek-R1 into smaller models (1.5B–70B parameters) so that even compact models could inherit the reasoning patternsarxiv.orgarxiv.org. All models (DeepSeek-R1-Zero, DeepSeek-R1, and six distilled versions from 1.5B to 70B) are released openly to support researcharxiv.org.
Experimental Results: DeepSeek-R1 was evaluated on a wide range of benchmarks and compared to OpenAI’s O1 model and other baselines. DeepSeek-R1-Zero (the pure RL model) already achieves strong results across math, coding, and logic benchmarks _without any supervised data_
arxiv.org. For example, with majority voting on answers it scores 86.7% on the MATH-500 benchmark, slightly exceeding the performance of OpenAI’s O1 on that testarxiv.org. However, R1-Zero’s Codeforces coding competition rating was lower, indicating weaker coding performance compared to O1 (1444 vs ~1843 Elo)arxiv.org. The refined model DeepSeek-R1 shows even more competitive performance. It achieves 79.8% Pass@1 on the AIME 2024 math exam, slightly surpassing OpenAI’s O1 (which scored ~79.2%)arxiv.orgarxiv.org. On the challenging MATH-500 benchmark, DeepSeek-R1 reaches 97.3%, essentially on par with O1’s 96.4%arxiv.orgarxiv.org. In coding tasks, DeepSeek-R1 attains an Elo rating of 2029 on Codeforces, outperforming 96.3% of human participantsarxiv.organd approaching O1’s rating (2061)arxiv.org. On knowledge-intensive evaluations like MMLU (multi-domain academic questions), DeepSeek-R1 scored 90.8%, slightly below O1’s **91.8%**arxiv.orgarxiv.org. Similarly, for factual Q&A (GPQA Diamond and SimpleQA benchmarks), DeepSeek-R1 trails O1 (e.g. 71.5% vs 75.7% on GPQA; 30.1% vs 47.0% on SimpleQA)arxiv.orgarxiv.org. On some tasks DeepSeek-R1 even outperforms prior models from OpenAI/Anthropic: e.g. it exceeds Claude-3.5 and GPT-4 (OpenAI’s GPT-4_o_ variant) on math benchmarksarxiv.org. The authors highlight that reinforcement learning was highly effective at instilling reasoning: DeepSeek-R1 matched O1’s level on many reasoning tasks despite using only a fraction of the training cost and data (benefiting from Mixture-of-Experts architecture to reduce computation)ar5iv.orgar5iv.org.
Key Insights: Reinforcement learning can reliably elicit complex reasoning in LLMs. The pure RL training led to emergent behaviors like the model “rethinking” a problem (“aha” moments) during training
arxiv.org. However, some supervision was needed to fix format and language issues, which DeepSeek-R1 addressed with a hybrid RL+SFT approacharxiv.org. The comparisons show DeepSeek-R1 is competitive with OpenAI’s O1 on a wide range of tasksarxiv.org. It slightly surpasses O1 in mathematical reasoning and matches it in coding, thanks to the model’s intensive chain-of-thought trainingarxiv.org. On the other hand, O1 maintains a small edge in certain knowledge and QA tasksarxiv.org. The paper also demonstrates that distillation of reasoning is highly effective: even a 1.5B parameter distilled model (DeepSeek-R1-Distill-Qwen-1.5B) achieves 28.9% on AIME and 83.9% on MATH, outperforming larger models like GPT-4o and Claude-3.5 on those math tasksarxiv.org. This suggests that complex reasoning strategies learned via RL can be transferred to much smaller models, which is promising for making advanced reasoning more accessible.
Strengths and Limitations: DeepSeek-R1’s strengths lie in its reasoning depth and open accessibility. It leverages long, structured solutions to attain high accuracy on complex problems, essentially matching a private OpenAI model on many benchmarks
arxiv.orgarxiv.org. Unlike O1, it is openly released, along with multiple scaled-down versions, which is a significant contribution to the communityarxiv.org. A noted weakness is efficiency and polish: the RL-trained model initially had poor readability and would mix languages in its answersarxiv.org. While the final DeepSeek-R1 mitigated this, it remains prompt-sensitive – the authors observed that few-shot prompts degrade its performance, recommending zero-shot usage for best resultsarxiv.org. DeepSeek-R1 also did not substantially improve over its predecessor on some software engineering tasks due to the difficulty of applying RL to those long-horizon problems (e.g. code generation with function constraints)arxiv.org. The authors acknowledge that multi-turn dialogues, tool use (function calling), and structured outputs (like JSON) are areas where DeepSeek-R1 still lags behind its predecessor (and likely behind OpenAI’s models)arxiv.org. These limitations point to the need for further refinement beyond pure reasoning ability.
Token-Hungry, Yet Precise: DeepSeek-R1’s Multi-Step Reasoning vs Speed (Evstafev, Jan 2025)
Objective: This short empirical study focuses on DeepSeek-R1’s performance on challenging math problems from the MATH dataset
arxiv.org. The author investigates whether DeepSeek-R1’s reasoning-centric approach can solve math problems that other models could not, if given unlimited time/tokens. Previously, many models failed on these problems under strict output length or time limits. Here, DeepSeek-R1 was allowed to generate very detailed, step-by-step solutions without time constraints, to see if it can find correct answers via exhaustive reasoningarxiv.org. Its performance is compared against four other models of similar scale (including gemini-1.5B, GPT-4o-mini, Llama3.1-8B, and Mistral-8B) across varying decoding settingsarxiv.org.
Findings: DeepSeek-R1 solved problems that none of the other models could, demonstrating superior accuracy on 30 extremely challenging MATH questions
arxiv.org. Notably, R1 achieved the highest accuracy across 11 different temperature settings, indicating its robustness in reasoning through difficult questions. The key trade-off observed is that DeepSeek-R1 is “token-hungry” – it produces far more tokens (longer explanations) than the other models to reach a solutionarxiv.org. In fact, R1 often generated significantly longer chains-of-thought to crack problems, whereas other models gave up or produced short (incorrect) answers. This confirms DeepSeek-R1’s intended design: it relies on extensive multi-step reasoning to maximize accuracyarxiv.org. However, this comes at the cost of efficiency – R1’s solutions are slower and longer, which may be impractical for applications needing quick responses. The author quantitatively illustrates a trade-off between accuracy and speed: DeepSeek-R1 outperforms smaller models in accuracy by a large margin, but those models generate answers with far fewer tokensarxiv.org. The study emphasizes that for tasks like advanced math, accuracy gains from thorough reasoning outweigh the cost in latency, whereas for simpler tasks a faster, less verbose model might be preferablearxiv.org.
Key Insight: DeepSeek-R1’s strength lies in multi-step reasoning thoroughness – it will use as many steps as needed to get the correct answer, which is why it succeeded on problems others missed. This highlights a broader point: long chain-of-thought reasoning can solve complex tasks that short answers cannot, but it requires sacrificing speed and conciseness
arxiv.org. The paper underscores the importance of choosing models based on task needs: if the task demands highest accuracy on complex reasoning (e.g. difficult proofs or puzzles), a model like DeepSeek-R1 is ideal; if the task requires quick responses or is time-sensitive, a more efficient model might be betterarxiv.org. In summary, DeepSeek-R1 excels in precision but at a computational cost, reinforcing the notion that current state-of-the-art reasoning models may need to balance reasoning depth with efficiency in future improvements.
Brief Analysis of DeepSeek-R1 and Implications for Generative AI (Mercer et al., Feb 2025)
Context and Goals: This article is a high-level “think piece” discussing the release of DeepSeek-R1 and what it signifies for the AI field
ar5iv.org. It places DeepSeek-R1 in the context of recent Chinese-developed models and the ongoing GPU export restrictions. The authors note that DeepSeek-R1 was developed in late Jan 2025 “at a fraction of the cost” of OpenAI’s models – despite Chinese labs facing U.S. GPU export bans – yet it remains _competitive with OpenAI’s best_ar5iv.org. This accomplishment is analyzed as part of a trend: in recent weeks, several cutting-edge models from China have been released, all sharing some key strategies. The report briefly surveys these models (though it does not go in-depth on each) and identifies common technical themes: an innovative use of Mixture-of-Experts (MoE) architectures, reinforcement learning (RL) for training, and other clever engineering tricks are “key factors” enabling these models’ capabilitiesar5iv.org. Essentially, Chinese research groups are leveraging smart algorithmic approaches to compensate for limited access to ultra-large hardware. DeepSeek-R1 is highlighted as a prime example, using MoE (37B active parameters out of 671B total) and large-scale RL to achieve performance on par with a much more expensively trained OpenAI modelarxiv.orgarxiv.org.
Main Points: The report discusses how DeepSeek-R1’s release could shape the landscape of generative AI. First, it demonstrates that OpenAI’s dominance can be challenged by significantly lower-budget efforts – a development likely to spur more open competition. The fact that R1 is open-sourced is significant: labs worldwide can now study and build upon a model that nearly matches a top proprietary model (O1) in reasoning
arxiv.org. The piece suggests that the ecosystem is diversifying, with non-Western labs innovating on architecture (e.g. MoE for efficient scaling) and optimization (RL for reasoning) to produce competitive modelsar5iv.org. Another point is the role of policy and hardware constraints: the US GPU export ban could have stifled AI development, but instead it encouraged alternative solutions. By focusing on efficient algorithms and distributed training, models like DeepSeek-R1 managed to circumvent hardware limitationsar5iv.org. This implies that future research might increasingly emphasize compute-efficient training techniques, which benefit the entire field by lowering barriers to entry. The authors also identify “several further areas of research” prompted by R1’s development (we detail some in Future Directions below)ar5iv.org. Overall, this analysis portrays DeepSeek-R1 as not just an incremental model release, but as a milestone illustrating how targeted innovations (like MoE and RL-driven reasoning) can yield state-of-the-art results without the massive resources that companies like OpenAI have. It calls on the community to learn from these approaches, and it foreshadows that continued progress in generative AI will come from both scale and algorithmic ingenuity.
To illustrate how DeepSeek-R1 stacks up against OpenAI’s O1, we compare their performance on selected benchmarks reported in the DeepSeek-R1 paper. Table 1 highlights key metrics across knowledge, reasoning, and coding tasks:
Table – Key Benchmark Performance: DeepSeek-R1 vs OpenAI O1-1217
arxiv.orgarxiv.org
Benchmark (Task)
Metric
OpenAI O1-1217
DeepSeek-R1
MMLU (knowledge)
Pass@1 (%)
91.8arxiv.org
90.8arxiv.org
GPQA Diamond (QA)
Pass@1 (%)
75.7arxiv.org
71.5arxiv.org
SimpleQA (factual)
Accuracy (%)
47.0arxiv.org
30.1arxiv.org
Codeforces (coding)
Elo Rating
2061arxiv.org
2029arxiv.org
AIME 2024 (math)
Pass@1 (%)
79.2arxiv.org
79.8arxiv.org
MATH-500 (math)
Pass@1 (%)
96.4arxiv.org
97.3arxiv.org
Strengths and Weaknesses of Each Model: Based on the above results and the analyses in the papers, we can identify the following for DeepSeek-R1 and O1:
-
DeepSeek-R1 – Strengths: Exceptional multi-step reasoning and complex problem-solving ability (especially in math and logic) – it matches or exceeds O1 on advanced benchmarks like AIME and MATH-500
arxiv.orgarxiv.org. It has near state-of-the-art coding skills (competitive programming Elo ~2029) despite a fraction of the training costarxiv.orgarxiv.org. The model is open-source and was trained with innovative techniques (pure RL and MoE), demonstrating a high return on computational investmentar5iv.org. DeepSeek-R1’s open release, along with distilled smaller models, is a major strength for the research community, enabling wider access to a high-quality reasoning modelarxiv.org.
-
DeepSeek-R1 – Weaknesses: Its efficiency is a concern – R1 tends to generate very long explanations, consuming many tokens/time to reach an answer
arxiv.org. This “token-hungry” behavior means it may be less suited for real-time or succinct applications. R1 also shows sensitivity to prompt format (few-shot prompting hurts performance)arxiv.org, indicating it can be finicky outside of its tuned usage. In terms of breadth of capability, R1 is optimized for reasoning but is less proficient than expected in some general-purpose tasks: e.g. it struggles with certain interactive or structured outputs (function calls, multi-turn dialogue) where the older DeepSeek-V3 or OpenAI’s models are betterarxiv.org. While R1 handles English and Chinese well, it may mix languages or default to English on queries in other languagesarxiv.org, reflecting limited multilingual training. Finally, R1’s factual recall is slightly lower than O1’s – for instance, it underperforms on simple QA tasks, suggesting its knowledge integration is not as strong as its reasoning (likely due to the focus on RL over massive knowledge ingestion)arxiv.org.
-
OpenAI O1 – Strengths: O1 (specifically the December 2024 “1217” version) represents one of OpenAI’s top models, and it excels as a well-rounded general LLM. It holds a slight performance edge in knowledge-intensive and factual tasks, topping DeepSeek-R1 on benchmarks like MMLU and SimpleQA
arxiv.orgarxiv.org. O1 also achieves marginally higher performance in code generation/competition (highest Codeforces rating of the group)arxiv.org, indicating superb problem-solving in coding domains. Unlike the RL-heavy R1, OpenAI’s model likely underwent extensive supervised fine-tuning and RLHF, giving it strong alignment with user needs and format requirements (for example, O1 doesn’t exhibit the readability or prompt-sensitivity issues observed in R1). Indeed, O1’s outputs tend to be more concise and efficient, solving tasks with fewer tokens since it was engineered to optimize user-facing qualityarxiv.org. In summary, O1’s strength is its balanced excellence across tasks – it performs at or near state-of-the-art in reasoning, coding, and knowledge, without any major weak spots in usability.
-
OpenAI O1 – Weaknesses: As a proprietary model, O1’s main drawbacks are external to performance – notably, lack of accessibility and high development cost. The DeepSeek analysis notes that R1 achieved comparable results at a fraction of the cost
ar5iv.org, implying O1 required far greater computational resources (likely billions of tokens of training and RLHF on large clusters). This heavy reliance on scale is a weakness in terms of reproducibility and research openness. O1 is closed-source and not publicly available, so researchers cannot inspect its architecture or fine-tune it, whereas R1 is openly released. Technically, O1’s MoE architecture (37B active/671B total parameters) and training details are not fully disclosed, so the community has less insight into how it achieves its performancearxiv.org. Another potential weakness is that O1 may not be as specialized in deep reasoning as R1 – for example, R1’s RL training allows extremely long chain-of-thought, whereas O1 (optimized for deployment) might limit reasoning depth to maintain speedarxiv.orgarxiv.org. The consequence is that in certain edge cases (like the hardest MATH problems), O1 could miss solutions that R1 finds by brute-force reasoning. Overall, however, O1’s weaknesses are subtle; the papers make clear that OpenAI’s O1 remains a slightly higher benchmark on average, especially on knowledge and practical tasksarxiv.org, albeit at significantly greater training expense.
Future Research Directions
Based on the findings and limitations highlighted in these papers, several interesting research directions emerge:
-
1. Enhancing General-Purpose Abilities of Reasoning Models: DeepSeek-R1 shows top-tier performance in focused reasoning tasks, but it falls short on general assistant capabilities like multi-turn dialogue, tool use (e.g. code execution or function calling), and output formatting
arxiv.org. Future work should explore how to combine R1’s reasoning prowess with broader interactive skills. For example, integrating chain-of-thought reasoning into dialogue agents or enabling R1 to follow API call instructions could produce models that are both smart and practically useful. This may involve additional supervised fine-tuning on conversational data or multi-modal training to handle complex interactions.
-
2. Multi-Lingual and Cross-Lingual Reasoning: DeepSeek-R1 currently handles English and Chinese well, but it struggles with queries in other languages (tending to respond in English)
arxiv.org. Research could focus on training reasoning models that maintain language fidelity – e.g. performing chain-of-thought reasoning in the user’s language. Approaches might include multilingual RL training data, or adding translation and language-identification mechanisms so that reasoning remains grounded in the input language. Achieving robust multi-lingual reasoning would broaden the applicability of models like R1 globally.
-
3. Improving Efficiency of Chain-of-Thought Reasoning: The “token-hungry” nature of DeepSeek-R1 highlights a need for more efficient reasoning strategies. One research avenue is to develop methods that preserve R1’s accuracy while reducing the length of its solutions (and thus inference time). Ideas include optimized planning algorithms that avoid redundant steps, or adaptive computation that can allocate just enough reasoning steps per problem. Another idea is step refinement: having the model generate a concise plan first, then expand only if needed. The goal is to reconcile the trade-off such that models can deliver fast, concise answers for easier queries and detailed solutions for hard ones without always defaulting to maximum length
arxiv.org. Techniques like dynamic stopping criteria for chain-of-thought or token budget awareness during RL training could be explored.
-
4. Robust Prompting and Few-Shot Generalization: The sensitivity of DeepSeek-R1 to prompting suggests that future models should be trained or adapted to be more prompt-robust
arxiv.org. Research can investigate why few-shot prompts degrade R1’s performance (e.g. does the provided reasoning in the prompt conflict with its learned strategy?) and how to fix this. Potential directions include RL or adversarial training where the model is exposed to various prompt styles and must learn to handle them gracefully. Additionally, enabling models to learn from very limited examples on the fly (improving few-shot learning) without losing their core reasoning ability would make them more flexible. This might involve meta-learning techniques applied on top of the RL-honed model.
-
5. Scaling Mixture-of-Experts and RL Further: The reviewed papers credit Mixture-of-Experts (MoE) architectures and reinforcement learning as key to recent breakthroughs
ar5iv.org. An important research direction is to study how far this paradigm can be pushed. For example, can we train even larger MoE models (with thousands of experts) via RL for reasoning, and what challenges arise (e.g. stability, credit assignment to experts)? Also, investigating hybrid models that use MoE for some aspects (like knowledge retrieval) and dense layers for others (like fine-grained reasoning) could yield efficiency gains. Further analysis of DeepSeek-R1’s training could reveal how expert gating was utilized during reasoning – guiding improvements in MoE design. Ultimately, understanding the interplay between model architecture (MoE) and training algorithm (RL) will help build next-generation models that are both powerful and resource-efficient.
-
6. Resource-Efficient Training & Democratization of LLMs: A broader direction, underscored by DeepSeek-R1’s success under hardware constraints, is developing techniques that reduce the reliance on giant compute clusters. This includes better algorithms for distributed RL (so that many smaller GPUs can replace a few large ones), quantization or memory optimization for huge models, and leveraging transfer learning (as seen in R1’s distillation) to reuse knowledge. The goal is to lower the entry barrier for cutting-edge LLM research. As noted in the analysis, R1’s development under a GPU export ban is a case study in ingenuity
ar5iv.org– future research could formalize these ad-hoc strategies into a coherent methodology for training competitive models with limited resources. This will likely involve interdisciplinary work, blending systems engineering with machine learning (for example, asynchronous RL algorithms that make better use of hardware, or curriculum learning to maximize sample efficiency). By making training more accessible, we ensure that more research groups can contribute innovations, echoing the open-source spirit of DeepSeek-R1.
Conclusion: The emergence of DeepSeek-R1 and its comparison with OpenAI’s O1 model over the last month has provided valuable insights. We have seen that through clever training (RL-driven reasoning) and architecture (MoE), an open model can closely match a top-tier proprietary model
arxiv.org. Each model exhibits unique strengths – O1 excels in broad knowledge and polish, while R1 leads in rigorous reasoning – and understanding these can guide where to focus improvements. Going forward, integrating the best of both (the generalist and the specialist) is an exciting challenge. The research directions outlined above aim to address current weaknesses and open new avenues, from making reasoning models more efficient and versatile to pushing the boundaries of training paradigms. As the field advances, the synergy of community-driven efforts (like DeepSeek-AI’s) and established industry research will likely accelerate progress towards more powerful, versatile, and accessible AI systemsar5iv.orgar5iv.org. The rapid developments in just the last month suggest that 2025 will be a year of significant leaps in reasoning-capable AI, driven by both competitive benchmarking and collaborative innovation.