Cell could offer dramatic boost for scientific computing

Cell could offer dramatic boost for scientific computing

Apr 28, 2020 / By : / Category : 老域名购买

A new paper from a group at Lawrence Berkeley National Laboratory, “The Potential of the Cell Processor Scientific Computing,” explores the performance of IBM’s Cell processor on some specific types of code commonly found in high-performance computing (HPC) applications. The programs used in the study are essentially smallish code blocks called kernels (see this older article for more on kernels and benchmarking) that implement typical algorithms like FFTs, stencil computations, and matrix multiplication. The paper compare Cell’s performance on these kernels to the performance of the Cray X1E, AMD Opteron, and Intel’s Itanium2. 老域名购买

The idea here is that Cell will be a commodity processor (at least that’s what the authors and IBM hope), so it’ll be a viable HPC alternative for the cost-sensitive academic research market. This paper represents the first formal academic attempt to decide if Cell hardware is something that researchers will want to invest in.

So how does Cell stack up in comparison to these three competitors? In a word, it screams.

First, the good news

Take a look at the following results for single-precision dense matrix multiplication, or GEMM (all numbers are Gflop/s):

Cellpm: 204.7Cray X1E: 29.5AMD64 7.8:Itanium2: 3.0

The “pm” above means “performance model.” Because Cell hardware isn’t generally available for tests like this, the paper’s authors used a combination of performance projections and benchmarks on a cycle-accurate simulation of Cell that IBM has released. Real-world results should be very comparable to those in the paper, if not even better.

Note that the above results aren’t exactly typical. In some of the rest of the tests, Cell is only a mere ten times faster than the competition. Also, I should mention that the paper also looks into power consumption, and Cell still manages to trounce the other guys at performance/watt.

Needless to say, these results are extremely promising, and the authors of the paper clearly believe that Cell could change the HPC game if it is available in quantity and at commodity prices. I personally think that Cell’s “commodity” status outside of the PS3 is a bigger “if” than the paper presumes, but we’ll see soon enough.

Now for the caveats

So now that we’ve seen that Cell blows away the competition for these HPC kernels, that means that it’s going to completely dominate the next-gen console market and kill Itanium, right? Not exactly.

First, single-precision (SP) is the place where Cell really blows the doors off the barn, because SP is what game developers need. IBM made some compromises on double-precision (DP) performance, with the result that such performance is a fraction of what it is for SP. On DP code, Cell merely leads the pack for most of the tests.

The paper’s authors propose a microarchitectural improvement to Cell’s DP capabilities that they call Cell+, and they’re clearly hoping IBM will adopt their suggestion. Cell+ significantly enhances DP throughput with minimal changes, so we’ll see if IBM bites.

Another thing that should be pointed out is that the Cell used in the paper has full access to all eight SPEs, and not the six SPEs of the PS3. (Remember, one SPE is disabled for yield reasons, and the other is reserved for the system.) So keep this in mind when fantasizing about how these results are going to extrapolate to the PS3 hardware.

More important than the eight vs. six SPE issue is the fact that, due to the nature of the kernels used and the way that they were implemented for these tests, taking these results and trying to think about how a future iteration of Gran Turismo will look on the PS3 is a bit like comparing apples to cucumbers. Here’s why.

Programming models and the big picture

To get the kinds of mind-blowing results found in the paper, the Berkeley team took each kernel and custom-fit it to the bare Cell hardware using labor-intensive intrinsics and extensive hand optimization. They didn’t rely on IBM’s higher-level development tools, and they didn’t even code the kernels in C. In other words, they were operating at “Tier I” of the Cell programming complexity hierarchy. By taking into account things like the deterministic load latencies at the various levels of the memory hierarchy, this code was tuned and timed, cycle by cycle and word by word, to fit the cell hardware.

Our first Cell implementation, SpMV, required about a month of learning the programming model, the architecture, the compiler, the tools, and deciding on a final algorithmic strategy. The final implementation required about 600 lines of code. The next code development examined two flavors of double precision stencil-based algorithms. These implementations required one week of work and are each about 250 lines, with an additional 200 lines of common code. The programming overhead of these kernels on Cell required significantly more effort than the scalar version’s 15 lines, due mainly to loop unrolling and intrinsics use. Although the stencils are a simpler kernel, the SpMV learning experience accelerated the coding process.

Having become experienced Cell programmers, the single precision time skewed stencil — although virtually a complete rewrite from the double precision single step version — required only a single day to code, debug, benchmark, and attain spectacular results of over 65 Gflop/s. This implementation consists of about 450 lines, due once again to unrolling and the heavy use of intrinsics.

The authors were able to do this kind of custom fit because they picked a programming model based on data parallelism. What this means is that they had the eight SPEs doing identical work on different parts of a highly parallel dataset. When you’ve got all eight SPEs marching in lock-step through a large, parallel dataset, then you can really put all of the hardware on that chip to work in a dramatic way, as the paper indeed shows.

IBM, however, is pushing a task-based approach to parallel programming the Cell, where there are many individual tasks running concurrently on the different SPEs. This is way harder to code for and optimize than the data parallism-based approach used in the paper, but it’s also where the money’s at in the consumer and game markets.

In the end, what the paper demonstrates is that, for the HPC kernels that are amenable to a data parallelism programming model, then Cell’s particular combination of a software-controlled memory hierarchy (with deterministic load latencies) and an obscene amount of parallel execution hardware is clearly the way to go. This approach is dramatically superior to a general-purpose computing architecture with a hardware-controlled memory hierarchy from both performance and performance/watt perspectives.

If Cell doesn’t really catch on as a commodity part outside the PS3, I expect we’ll eventually be posting a news item about a lab somewhere (Iran?) that placed an order for 200 PS3 consoles, with plans to cluster them.

Speaking of the PS3, that’s going to feature mostly task-based programming, which as I just said is a different beast than what was done in the Berkeley paper. Also, the programming will be done at higher levels of abstraction from the hardware. So please, don’t read this and then assume that Cell will administer a similar drubbing to general-purpose architectures like Opteron, Itanium, and Conroe on all game, physics, and AI code.

No Comments