Just to give an example of what I’m after, this is the same test run on a Xeon (nanasi, I promise this is my last interference with your thread!):

$ perf stat -e task-clock,cycles,instructions,branches,branch-misses -- ./a.out

 Performance counter stats for './a.out':

       2050.547118      task-clock (msec)         #    0.999 CPUs utilized
     6,362,850,183      cycles                    #    3.103 GHz
     4,809,362,376      instructions              #    0.76  insns per cycle
       801,642,789      branches                  #  390.941 M/sec
       199,930,425      branch-misses             #   24.94% of all branches

       2.052832966 seconds time elapsed

Branch misses amounting to 200M is clearly correct on both platforms, but we need a reliable way to get the altogether branch count on the A72.

ps: Fun fact: notice how the ‘RISC’ A72 needs both less instructions and less cycles for this test than the ‘CISC’ Xeon? ; )

