A major difference between x86 and ARM32 is that while x86 generally1 only offers conditional execution of branch instructions (e.g. BNE) ARM32 offers conditional execution for many more instructions (e.g. ALU instructions, like ADD.EQ- add if not equal). The idea was that these can be used to avoid emitting traditional branching instruction sequences which may suffer from pipeline stalls when a branch is mispredicted — instead, straight line code with equivalent semantics can be emitted.
This was actually removed in ARM64. The official quote:
The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf
Turns out that the whole pipeline stall problem is generally not a huge issue anymore as branch prediction has gotten so good, while support for the feature still requires allocating valuable instructions bits to encode the conditions. Note that ARM64 still uses 32 bit instructions, so conserving bits is still useful.
What is very interesting is that Intel’s recent APX extensions to x86-64 (whose purpose is to primarily add more general purpose registers) moves closer to this conditional instructions direction.
The performance features introduced so far will have limited impact in workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates performance of such workloads. Branch predictor improvements can mitigate this to a limited extent only as data-dependent branches are fundamentally hard to predict.
To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).”
Including support for conditional loads and stores which is apparently tricky with modern out of order and superscalar architectures.