This is probably because you're comparing them to consumer GPUs, which are desig...

kllrnohj · on April 14, 2020

> The Teslas have good double-precision performance - but they are priced waaaay higher than the consumer cards.

Yup - the Tesla V100 is 7 TFLOPs of double-precision at around ~$9000

A huge split between consumer & HPC happened in the aftermath of the Fermi (2010) architecture. Fermi was really bad in the consumer space from all the wasted die spent on unused double-precision. It was late, hot, and loud. And barely even faster than the competition.

With Maxwell Nvidia basically removed all the FP64 support from the architecture itself ( https://www.anandtech.com/show/9059/the-nvidia-geforce-gtx-t... - native FP64 rate is 1/32'd the FP32 rate) - the result was a huge boost to gaming performance. But it also meant that HPC users who want double-precision had to use Tesla cards. The actual architectures between GeForce & Tesla are different now, it's not "just" a lockout anymore.

Robotbeat · on April 14, 2020

Exactly. About 780Gflops/$ for Tesla V100 and 460Gflops/$ (peak) for the Epyc 7742.

I was very surprised it was this close. I thought the accelerator would be an order of magnitude cheaper per double Gflop. And AMD isn't using AVX512, yet.

fluffything · on April 15, 2020

A different question is how easy it is to use those FLOPS. The memory bandwidth on a GPU is much higher than on a CPU. Higher FLOP count is useless if the CPU is stalled on memory.

sudosysgen · on April 15, 2020

However, CPUs have much much better branching performance, and are much easier to fully utilize.

diroussel · on April 15, 2020

This is a good point. One place I worked tried to move thier numerical simulations to GGPU, and found they had to move from a recursive algorithm to one more suited to GPU architecture.

They did the rewrite, but found the the results were 10% less precise in the convergence on a solution.

So the imense parallel nature of a GPU is only useful if your algorithms are the right shape, such as a fixed number of matrix multiplies.

jiggawatts · on April 15, 2020

Imagine the very near future when AMD starts using TSCM's 5nm process, which has approximately double the transistor density of the current 7nm process used for the EPYC 7002 series.

They could go to DDR5, PCIe 5, AVX 512, and still have a transistor budget left over for whatever they like.

The 'whatever' is the interesting part. What exactly does a GPU do that a CPU doesn't?

Typical GPUs have crazy high memory bandwidths and good latency hiding by using many (thousands) of threads.

So if AMD does something like increase the number of memory channels and implement 4-way SMT, they're poised to upset NVIDIA in the HPC space in a big way.

Many people would much rather program for a general-purpose processor than the CUDA platform with all of its quirks and limitations...

kllrnohj · on April 15, 2020

The memory bandwidth gap between CPUs and GPUs is absolutely ludicrously massive. GPUs already crossed the 1TB/s mark. Epyc Rome is only 204GB/s with DDR4-3200.

It's been like this for a decade at least, I don't expect that gap to shrink anytime soon.

But realize also the Tesla V100 is still on TSMC 12nm. If Nvidia is moving these they are obviously also going to make 7nm and eventually 5nm variants. Which will also benefit from 2x+ density.

jiggawatts · on April 15, 2020

Let's do some maths for my hypothetical EPYC 3:

1) DDR5 is about 2.5x the speed of DDR4: https://www.anandtech.com/show/15699/sk-hynix-ddr5-8400 2) Dual socket roughly doubles the bandwidth. Measurements are showing something like 300GB/s in practice: https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/6 3) AMD could add extra memory channels, a 50% increase is reasonable.

300GB/s x 1.5 for more channels x 2.5 for DDR5 = 1.1 TB/s.

Not too shabby! As you said, it would likely be eclipsed by the next-gen NVIDIA accelerator, but... damn, over a terabyte per second for general-purpose compute is just nuts.

In principle, AMD could go even higher if they really tried to optimise the platform for this one metric, but server CPUs tend to be "balanced", so I doubt this will happen. One can dream...

kllrnohj · on April 15, 2020

> DDR5 is about 2.5x the speed of DDR4: https://www.anandtech.com/show/15699/sk-hynix-ddr5-8400 2)

No it isn't, it's ~1.5x the speed of DDR4: 4800 vs. 3200.

The 8400 is a hypothetical module that they _plan_ to make not that they've actually managed to make. And the first generation of CPUs with DDR5 support are unlikely to immediately support the maximum DDR5's spec plans to achieve. Just like CPUs only very recently officially supported DDR4-3200, despite that being on the market for years and years (the 9900K only officially supports up to DDR4-2666 even).

> AMD could add extra memory channels, a 50% increase is reasonable.

Say what now? A 50% increase is reasonable? You're expecting 12-channel memory? The 8-channels in Epyc Rome is already the most of any CPU on the market. I don't see any chance at all that this jumps to 12 in a single generation?

12 channel starts to become a physical packaging problem. dual-socket with 8-channels already is basically the maximum width of a board: https://www.supermicro.com/a_images/products/Aplus/MB/H12DSU...

So you'd have to do a max of 12 dimms per CPU instead of the current 16 dimms, which cuts your max realistic capacity down by a lot.

shaklee3 · on April 15, 2020

That's not going to happen. Extra memory channels are very expensive die-wise. Nvidia and AMD achieve these rates with hbm, which has very wide buses (4096 bits) and short traces from stacking. I can't see any way CPU memory will compete until they move to hbm. Keep in mind gddr6 is available in GPUs now, and is faster than ddr5, but much slower than hbm.

Dylan16807 · on April 15, 2020

The i/o die is 14nm right now. Once that shrinks, it leaves room for more channels.

EPYC could use HBM just fine if the advantage is pressing enough.

shaklee3 · on April 15, 2020

So can Intel, but they don't. Hbm would likely require them to sell a fixed-size memory amount, which can be severely limiting for server applications. Not to mention it's extremely power hungry compared to ddr, so you won't get anywhere near the amounts ddr gives without making power consumption go way up.

jiggawatts · on April 15, 2020

EPYC is already a modular architecture, literally nothing stops AMD replacing a couple of "compute" dies with HBM2 stacks. They could release CPUs that don't require DIMM sockets at all. E.g.: instead of 2 sockets + a bunch of DIMM sockets, the same motherboard space could be used for 4 sockets with embedded memory.

kllrnohj · on April 15, 2020

They could but then you're cutting your FLOPS down to get your memory bandwidth up. And HBM2 doesn't get you much capacity. The 7nm Instinct MI50 has 4 stacks of HBM2 to achieve 32GB in capacity. So other than as a joke toy, what would you do with a 32-core / 64-thread CPU with 32GB of RAM? That's what you'd end up with if you swapped out 4 compute dies for 4 HBM2 stacks.

jiggawatts · on April 15, 2020

That's 32GB per socket, with current technology.

Assume that in 1-2 years HBM capacity doubles, and it's a quad-socket motherboard. You'd have 64GB per socket, or 256GB total.

Remind me how much memory an NVIDIA accelerator has?

To play Devil's advocate, putting HBM2 in the package doesn't magically solve everything. The intra-socket bandwidth could be enormous, but the inter-socket bandwidth would still be whatever it is now, and would be difficult to increase.

kllrnohj · on April 16, 2020

> and it's a quad-socket motherboard

Epyc doesn't do quad sockets. Is this just another hypothetical "what if" at this point with no basis in reality?

Because sure, a hypothetical non-existent Epyc re-designed to compete in the double precision floating point space favoring memory bandwidth above all else could be really cool. Then again, so could anything else custom designed exclusively for that use case.

> but the inter-socket bandwidth would still be whatever it is now, and would be difficult to increase.

64 PCI-E 4.0 lanes form the CPU-CPU interconnect currently.

Since we're making up stuff why not assume that's doubled next generation along with being PCI-E 5.0? So that'd be 500GB/s give or take.

shaklee3 · on April 15, 2020

That would be great, but to date hbm is always part of the board. I'm all for selling motherboards with the ram already on it if it means higher bandwidth, but it's just never happened before.

mastax · on April 15, 2020

AFAIK HBM has always been on the interposer which is no different from the AMD chiplet approach.

chapplap · on April 15, 2020

That's not exactly true once you have to deal with vector extensions like AVX-512. It's quite a pain to write by hand (C intrinsics) and many of the ways to abstract it away end up giving you a GPU-like programming model (eg. Intel ISPC).

Plus, this has largely been tried before with Xeon Phi and it didn't end so well.

Huge vector units like AVX-512 are mainly useful for workloads that need huge amounts of RAM that you just can't get with a GPU, or for workloads that are very latency sensitive and incompatible with GPU task scheduling because they are in some other CPU-bound code.

imtringued · on April 15, 2020

>Huge vector units like AVX-512 are mainly useful for workloads that need huge amounts of RAM that you just can't get with a GPU, or for workloads that are very latency sensitive and incompatible with GPU task scheduling because they are in some other CPU-bound code.

There are a lot of tasks that a GPU can do faster than a CPU but it would require batching a large amount of work before you can gain a speedup. EYPC CPUs do not suffer that limitation. If all you have is an array with 4 elements you can straight up run the vector instructions and then immediately switch back to scalar code. Meanwhile with a GPU you probably need at least an array with 10000 elements or more.

jiggawatts · on April 15, 2020

> It's quite a pain to write by hand

And we all know that autovectorisation is hit-and-miss at best.

I wonder if there will be a new C-like language that has portable SIMD-like capabilities in the same sense that "C is a portable assembly language".

DeathArrow · on April 15, 2020

>The first observation is that in modern compilers, the resulting performance from auto-vectorization optimization is still far from the architectural peak performance. https://dl.acm.org/doi/fullHtml/10.1145/3356842

Maybe we can get get more benefits if we invest more resources in optimizing compilers than in inventing yet another Javascript framework?

Fronzie · on April 15, 2020

Another one is SyCL: https://www.khronos.org/sycl/ the SIMD part is OpenCL, SyCL provides scheduling/memory transfer on top.

mmozeiko · on April 15, 2020

There is - Intel SPMD Program Compiler. https://ispc.github.io/

benibela · on April 15, 2020

But does that work on AMD ?

einpoklum · on April 15, 2020

That is a problematic metric. You see, you don't buy individual GPUs or CPUs, you buy systems. And typically, you can stick multiple GPUs on the same system (more than CPU)s. There's also the question of what those servers cost, and how many rack units each one takes up etc.

And then - it might make more sense to measure power consumption and maintenance costs than up-front price.

Robotbeat · on April 14, 2020

I looked at both. Recent consumer cards have terrible double performance (with exception of Titan series) of course, but the extreme cost of the non-consumer cards offsets their double flop performance to be approximately no better[1] than Epyc 2 per dollar.

[1]EDIT: Tesla v100s are still about 40% lower cost per Gflop than Epyc 7742, but still definitely the same order of magnitude, unlike my expectation.