Question for those smarter than me: What is an application for an int128 type anyways? I've never personally needed it, and I laughed at RISC-V for emphasizing that early on rather than... standardizing packed SIMD.
Cryptography would be one application. Many crypto libraries use an arbitrary size `bigint` type, but the algorithms typically use modular arithmetic on some fixed width types (128-bit, 256-bit, 512-bit, or some in-between like 384-bits).
They're typically implemented with arrays of 64-bit or 32-bit unsigned integers, but if 128-bits were available in hardware, we could get a performance boost. Any arbitrary precision integer library would benefit from 128-bit hardware integers.
SIMD is for performing parallel operations on many smaller types. It can help with some cryptography, but It doesn't necessarily help when performing single arithmetic operations on larger types. Though it does help when performing logic and shift operations on larger types.
If we were performing 128-bit arithmetic in parallel over many values, then a SIMD implementation may help, but without a SIMD equivalent of `addcarry`, there's a limit to how much it can help.
Something like this could potentially be added to AVX-512 for example by utilizing the `k` mask registers for the carries.
The best we have currently is `adcx` and `adox` which let us use two interleaved addcarry chains, where one utilizes the carry flag and the other utilizes the overflow flag, which improves ILP. These instructions are quite niche but are used in bigint libraries to improve performance.
> but It doesn't necessarily help when performing single arithmetic operations on larger types.
For the curious, AFAIU the problem is the dependency chains. For example, for simple bignum addition you can't just naively perform all the adds on each limb in parallel and then apply the carries in parallel; the addition of each limb depends on the carries from the previous limbs. Working around these issues with masking and other tricks typically ends up adding too many additional operations, resulting in lower throughput than non-SIMD approaches.
There's quite a few papers on using SIMD to accelerate bignum arithmetic for single operations, but they all seem quite complicated and heavily qualified. The threshold for eeking out any gain is quite high, e.g. minimum 512-bit numbers or much greater, depending. And they tend to target complex or specialized operations (not straight addition, multiplication, etc) where clever algebraic rearrangements can profitably reorder dependency chains for SIMD specifically.
You can find a lot of motivation for 128-bit integers in that paper, such as fixed-point operations, implementing 128-bit (decimal) floating-point, financial calculations, cryptography, etc. However, the proposal has been superseded by P3666, which aims to bring C23's _BitInt type to C++, which wouldn't just allow for 128-bit integers (as _BitInt(128)) but for any other width as well.
I implemented a rational number library for media timestamps (think CMTime, AVRational, etc.) that uses 64-bit numerators and denominators. It uses 128-bit integers for intermediate operations when adding, subtracting, multiplying, etc. It even uses 128-bit floats (represented as 2 doubles and using double-double arithmetic[1]) for some approximation operations and even 192-bit integers in one spot (IIRC it's multiplying a 128-bit and 64-bit ints and I just want the high bits so it shifts back down to 128 bits immediately after the multiplication).
I keep meaning to see if work will let me open source it.
The last time I used one I wanted UNIX timestamps + fractional seconds. Since there was no difference between adding 1 bit or 64, I just gave it 32 bits for the fraction and 32 more bits for the integral part.
I made a time sync library over local network that had to be more precise than NTP and used i128 to make sure the i64 math I was doing couldn't overflow.
I32 didn't cover enough time span and f64 has edge cases from the nature of floats. This was for Windows (MACC not GCC) so I had to roll out my own i128.
We use them for exact predicates in our mesh booleans library. To really handle every degenerate case we even have to go quite a bit higher than 128bit in 3D.