Nice blog post, though I personally prefer the ridiculousfish post [0] he links to in the end, that one's an instant classic.
He mentions Windows/x86 a couple of times. I only wish it were as simple as "this platform does not reorder." Having done low-level, heavily-multithreaded work on Windows for years: it'll behave like a strongly-ordered architecture 999 times out of a 1000 (or more). Then it'll bite you in the ass and so something unexpected. Basically, if you're doing your own synchronization primitives on x86, you have to pretty much rely on visual/theoretical verification because tests won't error out w/ enough consistency. I've run a test (trying to get away w/ not using certain acquire/release semantics) for an entire week to have it error out only at the last second (x86_64). Other times, I've shipped code that's been tested and vetted inside out for months, only to have the weirdest bug reports 3 or 4 months down the line in the most sporadic cases.
I work for Intel. This is not correct. A lot depends on your cache type. The two basic ones are uncacheable and write-back.
What you wrote is true for UC. For WB, reads can happen in any order (especially due to cache pre-fetchers). Writes always happen in program order, unless you are in other cache types such as write-combining. WC is mainly used for graphics memory-mapped pixmaps, where the order doesn't matter.
But don't let this scare you too much. From the viewpoint of a single CPU, everything is in order. It's only when you look at it from the memory bus point-of-view that things get confusing.
Unless we are dealing with a memory-mapped IO device that has read side-effects, in which case you need to carefully choose a cache type.
I don't think you can from a user mode program. If you are dealing with normal memory, it will be typically WB, and that is the only thing most user-mode programs will encounter.
He mentions Windows/x86 a couple of times. I only wish it were as simple as "this platform does not reorder." Having done low-level, heavily-multithreaded work on Windows for years: it'll behave like a strongly-ordered architecture 999 times out of a 1000 (or more). Then it'll bite you in the ass and so something unexpected. Basically, if you're doing your own synchronization primitives on x86, you have to pretty much rely on visual/theoretical verification because tests won't error out w/ enough consistency. I've run a test (trying to get away w/ not using certain acquire/release semantics) for an entire week to have it error out only at the last second (x86_64). Other times, I've shipped code that's been tested and vetted inside out for months, only to have the weirdest bug reports 3 or 4 months down the line in the most sporadic cases.
0: http://ridiculousfish.com/blog/posts/barrier.html