May 30, 2025 Insurance Analysis

Intel's GPU Manufacturing Process

Advertisements

Intel’s Alchemist architecture has undeniably positioned the company as a significant player in the high-performance graphics domainWith the launch of the Arc A770, Intel displayed its ability to deliver reliable gaming performance as a first effortNow, the technology powerhouse is shifting focus to its new Battlemage graphics architecture, which aims to further enhance its offerings.

The Arc B580, a successor to A770, notably adopts a smaller overall footprintAmong the key differences between the two, the B580 features five rendering slices in contrast to A770’s eightThis change translates into a reduction in the number of FP32 channels available, down from A770’s 4,096 to 2,560 in B580, hinting at a somewhat reduced processing power.

Another notable element of the Battlemage architecture is its memory subsystem, which is indeed less robust than its predecessorThe B580 operates with a 192-bit GDDR6 bus running at 19 GT/s, resulting in a theoretical bandwidth of 456 GB/s, which is lower than A770's impressive 560 GB/s bandwidth that benefits from a wider 256-bit bus operating at 17.5 GT/s.

The connectivity aspect hasn’t benefited either; the B580 decreases the host interface to a single PCIe 4.0 x8 link, while effectively halving the capabilities of A770’s full-sized x16 linkTherefore, if Intel aims to surpass the benchmarks set by A770, they have considerable ground to cover with the B580’s architecture.

Focusing on the Xe cores of Battlemage, architecture adjustments become evidentThe most significant transition of Xe cores takes the stage with the advent of Lunar LakeThis revision allows for more streamlined processing units, notably refining the Xe Vector Engines (XVE). Intel has consolidated Alchemist’s XVEs into broader and more efficient execution units, marking a shift toward leveraging wider execution unit partitionsNonetheless, the throughput capacity of a Xe core remains constant, sustaining 128 FP32 operations per cycle.

The design also features a shared instruction cache across all XVE channels, which provides each XVE with critical data

Advertisements

Alchemist somberly had a 96 KB instruction cache, but Battlemage is likely to maintain at least a similar capacity, which should equate to holding anywhere from 6,000 to 12,000 commandsEncountered through intricate processing tasks, this aspect plays a crucial role in operational efficiency.

Delving deeper into the mechanics, the XVE serves as the smallest partition within the Intel GPU framework, with each capable of tracking a maximum of eight threads while seamlessly switching between them to mask delays delivering consistent supply to their execution unitsA 64 KB register file retains the thread status, allocating up to 8 KB of registers per thread, permitting remarkable utilization ratesHowever, the addressability of registers within Intel’s architecture presents a notable edge, allowing for more flexible operations compared to the constraints observed in architectures like Nvidia or AMD.

In Battlemage architecture, mathematical operations typically embrace 16-width or 32-width vectors, purposely sidelining any potential SIMD8 utilization encountered in AlchemistThis vector-focused execution reduces the instruction control overhead, making operations across vector lanes more streamlined.

Additionally, the handling of thread divergence improves in Battlemage, controlling how groups of threads operate in concert, a flexibility not as pronounced in A770. With this improvement, Battlemage manages to navigate divergences more gracefully, all while taking advantage of the higher vector lengths.

Much like Alchemist, the Battlemage framework continues to utilize two ports (ALU0 and ALU1) for most calculationsHere, ALU0 manages FP32 and FP16 operations, while integer math and less typical commands fall into the jurisdiction of ALU1. Intriguingly, the architecture’s port design mirrors Nvidia’s Turing structure but offers certain variations, particularly in the types of instructions issued in each cycle.

The XMX matrix multiplication unit, inherited from Alchemist, is set to be a critical player in the Battlemage structure

Advertisements

Intel strongly asserts it supports triplet co-launching, positing it on a separate port, although indications from analysis tools show various activities pipelined between ALUs and XMX units that could suggest otherwise.

Gaming workloads, predominantly reliant on floating-point operations, find ALU1 efficiently unloading processing burdens to ensure that ALU0 focuses solely on floating-point tasksThis contribution becomes pivotal given the GPU’s intent to minimize shared issues among the units.

As anticipated, software support specifically for XMX remains a concern, highlighting challenges in utilizing AI-supported workloads effectively across varying frameworks, pushing the utilization of conventional vector units into the spotlight instead.

Traditionally, benchmark tests indicate that the older A770 and A750 from Intel could leverage their larger shader arrays for increased computational throughput over B580. Nevertheless, the latter exhibits an enhanced consistency in delivering performanceIssues in FP32 FMA operations often hampered Alchemist but are expertly addressed in Battlemage, which seems capable of exceeding its theoretical throughput expectations.

When turning to integer processing, Battlemage showcases its adeptness at lower precision INT8 operationsThe evolution in architecture shows a promising trend where Battlemage utilizes resources more efficiently than prior generations.

Memory access architecture displays intriguing similarities to its predecessor but hints at significant optimizationsEach Xe Core contains eight texture samplers, each with a 32 KB texture cache—retaining some of the previous design yet exhibiting limited improvements in texture bandwidth with B580's higher clock speeds failing to offset the reduced number of Xe cores.

The global memory access processes are innovation-laden as well, with Battlemage utilizing a more substantial cache allocation of 256 KB, thanks to improvements originating from Intel’s previous generations

Advertisements

Advertisements

Advertisements

Share:

Leave a Reply