I think even this is overstated: Python via Numba and MATLAB both have excellent JIT compilers now, and for the maximum performance you need to hit SIMD. Latter is not reliable in Julia and the former, well, you already have Python and MATLAB.
"SIMD not reliable" is an overstatement itself for a couple reasons:
1. It's very easy to inspect generated code of your kernels where you really need SIMD. For instance:
julia> function my_kernel(xs)
total = zero(eltype(xs))
for x in xs
@fastmath total += x
end
total
end
julia> @code_native debuginfo=:none my_kernel(rand(10))
...
L96:
vaddpd 8(%rcx,%rax,8), %ymm0, %ymm0
vaddpd 40(%rcx,%rax,8), %ymm1, %ymm1
vaddpd 72(%rcx,%rax,8), %ymm2, %ymm2
vaddpd 104(%rcx,%rax,8), %ymm3, %ymm3
addq $16, %rax
cmpq %rax, %rdi
jne L96
...
SIMD convergence is a big issue here: how do you keep those lanes running together? If you have trivial kernels sure, but ISPC guarantees convergence across control flow which isn’t even available in CUDA (according to ISPC docs at least).
Just to be clear: I want to use Julia but without guessing about SIMD use in kernels. Until this is available for complex control flow in Julia, it’s not a game changer in my opinion.
Edit: that kernel is trivial. I have nested control flow to vectorize.
Right, I see what you're getting at. Just taking a slightly less trivial example from Intel's docs, and superficially comparing the assembly they generate to what LoopVectorization.jl can do, I guess there is some hope at least:
julia> function simple!(ys, xs)
@avx for i = eachindex(xs)
ys[i] = if xs[i] < 3.
xs[i]^2
else
sqrt(xs[i])
end
end
end
This generates vbroadcastsd, vpcmpgtq, vmulpd, vsqrtpd, vblendvpd and vmaskmovpd AVX2 instructions (I don't have AVX512). But more complex control flow does indeed not work; the `@avx` macro errors it can't handle it at the moment.
Thanks for the example, I am interested in trying these packages in the future, but here's a current snippet of ISPC,
varying float aff[nc], xh[nn], wij[nn], x[nc];
varying int ih[nn];
uniform int t_ = t & (nl - 1);
uniform float k;
for (int j=0; j<nc; j++)
aff[j] = 0.0f;
for (int j=0; j<nn; j++)
for (int i=0; i<nc; i++)
aff[i] += wij[j] * shuffle(xh[j], ih[j]);
for (int i=0; i<nc; i++)
x[i] = 0.1f*(x[i] + x[i]*x[i]*x[i]/3.0f + k*aff[i]);
for (uniform int i=0; i<nn; i++)
xh[i] = insert(rotate(xh[i], 1), 0, extract(x[i/nl], i&(nl-1)));
Just being able to absorb SIMD lanes into the notion of varying vs uniform makes this easy to write, not to mention SIMD operations like shuffle or rotate. In Julia or Numba or CUDA I have to index into arrays, ensure compatible data layout etc. I imagine this could be done with more macrology in Julia, but again why not use something which already works.
> but again why not use something which already works
Sure, but there's always somebody crazy enough to try to implement it :p
From the fact that ISPC can generate C++ (with --emit-c++) I would think their compiler is conceptually very high-level. Since so few new concept are introduced, I wouldn't be surprised if someone would implement the same DSL in Julia at some point, just to see if one can get similar performance.
> ISPC can generate C++ (with --emit-c++) I would think their compiler is conceptually very high-level
It's a Clang-based frontend for LLVM, so yeah, no reason Julia couldn't make use of their work. I think the main thing to tap into is the uniform vs varying support, since this drives the whole thing.
I'm sorry, but as a regular Matlab user, I have to say that, no, Matlab does not have an "excellent jit compiler." It has a jit compiler which is really limited, and works for straight-line code with built-in functions.
For performance you virtually always need to write 'vectorized' code, which is often difficult and memory heavy.
The daily churn of handling input parsing, along with contorting code into vectorized form takes up an extraordinary amount of time when writing Matlab code.
Matlab is certainly not 'good enough' for me! It's just what I'm stuck with at work.
I have also seen Matlab code match Julia for simple loops. If you write your code properly vectorized and with loops that use only calls to 'built-in' operators and C-functions, then it can match C/Julia.
But that's just extremely restrictive. The jit does not work on arbitrary user code.