How Huawei Outperformed Nvidia’s AI Chips — Despite Being a Generation Behind

Huawei and Nvidia
Huawei and Nvidia

 How Huawei Outperformed Nvidia’s AI Chips — Despite Being a Generation Behind


Huawei Technologies has found a workaround to the U.S. tech sanctions by developing a powerful AI data center architecture, allowing its “Ascend” chips to outperform Nvidia’s H800 GPUs in a specific AI workload.


According to a technical study, Huawei’s advanced data center architecture, CloudMatrix 384, enabled the company’s Ascend chips to surpass the performance of Nvidia’s H800 GPUs in running the DeepSeek R1 large language model (LLM).


The study, co-authored by researchers from Huawei and SiliconFlow—a Chinese AI infrastructure startup—describes CloudMatrix 384 as a "super node" designed specifically for intensive AI workloads.


Huawei believes CloudMatrix "will redefine the foundation of AI infrastructure," according to the research paper released this week. The system consists of 384 Ascend 910C neural processing units (NPUs) and 192 Kunpeng server CPUs, interconnected via a unified bus that offers ultra-high bandwidth and low latency, as reported by SCMP and seen by Al Arabiya Business.


The CloudMatrix-Infer service, optimized for LLM inference, leveraged this architecture to outperform some of the world’s top systems in running the DeepSeek R1 inference model, which includes 671 billion parameters.


This architecture reflects Huawei's efforts to bypass the U.S. technology restrictions, as the company seeks to enhance AI system performance despite sanctions.


What Makes Cloud Matrix Special?


Data centers are facilities that house high-capacity servers and data storage systems, with multiple power sources and high-bandwidth internet connections. Companies increasingly use them to host and manage AI computing infrastructure.


In the pre-fill phase—the stage where the model processes the initial prompt—Cloud Matrix-Infer achieved an output rate of 6,688 tokens per second per NPU, for prompts of 4,000 tokens. This equates to a computational efficiency of 4.45 tokens per second per TFLOP (trillion floating point operations per second).


Tokens are the basic units used by LLMs—such as those powering services like Chat GPT—to process text. Token length directly affects cost, processing time, and the model’s ability to understand and respond to complex narratives.


In the decoding phase, which generates the model's output, CloudMatrix achieved 1,943 tokens per second per NPU, for cached key-value memory of 4,000 token length—a memory structure that enhances AI processor utilization.


This stage consistently delivered output generation times under 50 milliseconds per token, translating to an efficiency of 1.29 tokens per second per TFLOP.


According to the study, these performance metrics outperformed Nvidia’s SGLang framework, which uses the company’s flagship H100 GPU, and another system using DeepSeek R1 with Nvidia’s H800 GPU.


A Strategic Milestone for Huawei


This is the first time Huawei has officially disclosed detailed performance data for its flagship AI accelerator, the Ascend 910C.


It also aligns with recent remarks by Huawei founder and CEO Ren Zhengfei, who acknowledged that the company’s Ascend chips are still a generation behind U.S. competitors. However, he emphasized that methods like stacking and clustering have enabled them to match the computing performance of the world’s leading systems.


Interestingly, Nvidia CEO Jensen Huang seemed to agree with this sentiment during a CNBC interview at VivaTech in Paris. He stated:


> "AI is a parallel problem, so if not every computer can handle it... just use more computers."




He added that China, with its vast energy resources, can use more chips to overcome hardware limitations and noted that China remains a key strategic market for the U.S. due to its massive pool of AI talent and being the world’s second-largest economy.

Post a Comment

2 Comments

  1. CloudMatrix will redefine the foundation of AI infrastructure

    ReplyDelete
  2. This is the first time Huawei has officially disclosed detailed performance data for its flagship AI accelerator

    ReplyDelete