I think even this is overstated: Python via Numba and MATLAB both have excellent...

stabbles · on Oct 20, 2020

"SIMD not reliable" is an overstatement itself for a couple reasons:

1. It's very easy to inspect generated code of your kernels where you really need SIMD. For instance:

    julia> function my_kernel(xs)
             total = zero(eltype(xs))
             for x in xs
               @fastmath total += x
             end
             total
           end
    julia> @code_native debuginfo=:none my_kernel(rand(10))
    ...
    L96:
     vaddpd 8(%rcx,%rax,8), %ymm0, %ymm0
     vaddpd 40(%rcx,%rax,8), %ymm1, %ymm1
     vaddpd 72(%rcx,%rax,8), %ymm2, %ymm2
     vaddpd 104(%rcx,%rax,8), %ymm3, %ymm3
     addq $16, %rax
     cmpq %rax, %rdi
     jne L96
    ...

2. There's a Julia package that does the code gen for vectorization exactly because LLVM does not always get it right: https://chriselrod.github.io/LoopVectorization.jl/stable/exa...

3. You can make LLVM explain why it did not vectorize certain loops just like in clang.

marmaduke · on Oct 20, 2020

SIMD convergence is a big issue here: how do you keep those lanes running together? If you have trivial kernels sure, but ISPC guarantees convergence across control flow which isn’t even available in CUDA (according to ISPC docs at least).

Just to be clear: I want to use Julia but without guessing about SIMD use in kernels. Until this is available for complex control flow in Julia, it’s not a game changer in my opinion.

Edit: that kernel is trivial. I have nested control flow to vectorize.

stabbles · on Oct 20, 2020

Right, I see what you're getting at. Just taking a slightly less trivial example from Intel's docs, and superficially comparing the assembly they generate to what LoopVectorization.jl can do, I guess there is some hope at least:

    julia> function simple!(ys, xs)
             @avx for i = eachindex(xs)
               ys[i] = if xs[i] < 3.
                  xs[i]^2
               else
                  sqrt(xs[i])
               end
             end
           end

This generates vbroadcastsd, vpcmpgtq, vmulpd, vsqrtpd, vblendvpd and vmaskmovpd AVX2 instructions (I don't have AVX512). But more complex control flow does indeed not work; the `@avx` macro errors it can't handle it at the moment.

marmaduke · on Oct 20, 2020

Thanks for the example, I am interested in trying these packages in the future, but here's a current snippet of ISPC,

    varying float aff[nc], xh[nn], wij[nn], x[nc];
    varying int ih[nn];
    uniform int t_ = t & (nl - 1);
    uniform float k;
    for (int j=0; j<nc; j++)
        aff[j] = 0.0f;
    for (int j=0; j<nn; j++)
        for (int i=0; i<nc; i++)
            aff[i] += wij[j] * shuffle(xh[j], ih[j]);
    for (int i=0; i<nc; i++)
        x[i] = 0.1f*(x[i] + x[i]*x[i]*x[i]/3.0f + k*aff[i]);
    for (uniform int i=0; i<nn; i++)
        xh[i] = insert(rotate(xh[i], 1), 0, extract(x[i/nl], i&(nl-1)));

Just being able to absorb SIMD lanes into the notion of varying vs uniform makes this easy to write, not to mention SIMD operations like shuffle or rotate. In Julia or Numba or CUDA I have to index into arrays, ensure compatible data layout etc. I imagine this could be done with more macrology in Julia, but again why not use something which already works.

stabbles · on Oct 20, 2020

> but again why not use something which already works

Sure, but there's always somebody crazy enough to try to implement it :p

From the fact that ISPC can generate C++ (with --emit-c++) I would think their compiler is conceptually very high-level. Since so few new concept are introduced, I wouldn't be surprised if someone would implement the same DSL in Julia at some point, just to see if one can get similar performance.

marmaduke · on Oct 20, 2020

> ISPC can generate C++ (with --emit-c++) I would think their compiler is conceptually very high-level

It's a Clang-based frontend for LLVM, so yeah, no reason Julia couldn't make use of their work. I think the main thing to tap into is the uniform vs varying support, since this drives the whole thing.

DNF2 · on Oct 20, 2020

I'm sorry, but as a regular Matlab user, I have to say that, no, Matlab does not have an "excellent jit compiler." It has a jit compiler which is really limited, and works for straight-line code with built-in functions.

For performance you virtually always need to write 'vectorized' code, which is often difficult and memory heavy.

The daily churn of handling input parsing, along with contorting code into vectorized form takes up an extraordinary amount of time when writing Matlab code.

Matlab is certainly not 'good enough' for me! It's just what I'm stuck with at work.

marmaduke · on Oct 20, 2020

Maybe, depends on the work. I’ve seen Matlab jit code to match C performance. I think it’s the language holding it back, mainly.

DNF2 · on Oct 21, 2020

I have also seen Matlab code match Julia for simple loops. If you write your code properly vectorized and with loops that use only calls to 'built-in' operators and C-functions, then it can match C/Julia.

But that's just extremely restrictive. The jit does not work on arbitrary user code.