Home » China’s first general-purpose GPU chip with 77 billion transistors, Biren Technology BR100 debuts overseas – Hardware – cnBeta.COM

China’s first general-purpose GPU chip with 77 billion transistors, Biren Technology BR100 debuts overseas – Hardware – cnBeta.COM

by admin
China’s first general-purpose GPU chip with 77 billion transistors, Biren Technology BR100 debuts overseas – Hardware – cnBeta.COM

On August 9, Birentech, a domestic technology innovation enterprise, officially released the BR100 series of general-purpose computing GPUs, claiming to be the first in terms of computing power in China, with multi-directional indicators comparable to or even surpassing international flagship products. On August 22, local time, on the first day of the 34th Hot Chips Conference, the general-purpose GPUs of NVIDIA Hopper, AMD Instinct MI200, and Intel Ponte Vecchio showed their muscles one after another, and Biren Technology appeared side by side with them. BR100.

access:

Ali cloud server selection special: 1 core 1G cloud server as low as 0.9 yuan / month

At the meeting, Hongzhou, co-founder and CTO of Biren Technology, and Xu Lingjie, co-founder and president of Biren Technology gave a keynote speech entitled “Biren BR100 GPGPU: Accelerating Datacenter Scale AI Computing”, introducing the professional audience from all over the world. Features of the BR100 chip and details of the original chip architecture.

According to the introduction,As a GPGPU chip mainly used to accelerate general-purpose computing at data center scale, BR100 has extremely high computing power density. The single-card 16-bit floating-point computing power reaches PFLOPS level, and it has high-speed on-chip and off-chip interconnection bandwidth.

BR100 adopts 7nm process technology, Chiplet small chip design and CoWoS 2.5D packaging technology.Deployed in the form of an OAM module, it can form an 8-card point-to-point full interconnect topology on a general UBB motherboard.

In order to support powerful computing power, BR100 is equipped with more than 300MB of on-chip cache for temporary storage and reuse of data, and 64GB of HBM2E high-speed memory.

See also  China's PCT international patent application ranks first in the world again, and 13 Chinese companies including Huawei are among the top 50 applicants - Top10

Its core computing unit is composed of a large number of general-purpose stream processors, with general-purpose computing and 2.5D GEMM architecture dedicated tensor acceleration computing power.

At the level of original architecture, Biren Technology provides a series of enhanced features for data flow according to the computing characteristics of general-purpose workloads such as deep learning, including special C-Warp collaborative concurrency mode, tensor data access accelerator TDA, NUMA/UMA access storage mode, near-storage computing, etc. These features are the key to the BR100’s ability to reach the world‘s leading level in terms of computing power and energy efficiency ratio.

In addition, Biren Technology also introducedA new TF32+ data type with higher precision than the TF32 data type.

In terms of software, Biren Technology also introduced the BIRENSUPATM software stack. Its core programming model has C/C++ programming interface and runtime API, and its style is similar to the mainstream GPGPU development language and programming paradigm.

It enables developers to program and develop on the BR100 very easily, while greatly reducing the workload of code migration, enabling seamless migration from mainstream programming environments to the Birensupa platform.

According to the data, Biren Technology BR100 integrates as many as 77 billion transistors, which is comparable in scale to human brain nerve cells. It is very close to the NVIDIA GH100 computing core with 80 billion transistors, and the BR100 series chips are successfully lit once!

In terms of performance, INT8 integer calculation 2048 Tops (2048 trillion times per second), BF16 floating point calculation 1024 TFlops (1024 trillion times per second), TF32+ floating point calculation 512 TFlops (512 trillion times per second), FP32 double precision Floating point 256 TFlops (256 teraflops).

See also  It is rumored that the star of "Star Wars Jedi" will have his own "Star Wars" series - Disney Disney - cnBeta.COM

In addition, its external IO bandwidth reaches 2.3TB/s, supports 64 channels of encoding, 512 channels of decoding, and also supports PCIe 5.0 and CXL interconnect protocols.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy