Home Forum Software discussion kernel 4.4.52-armada-17.06.2

Viewing 15 posts - 1 through 15 (of 16 total)
  • Author
    Posts
  • #376
    nanasi
    Participant

    latest kernel 4.4.52-armada-17.06.2-gcaa3a4f crashes during a boot.

    [ 2.881785] Unable to handle kernel NULL pointer dereference at virtual address 00000000
    [ 2.889915] pgd = ffffffc000cdf000
    [ 2.893332] [00000000] *pgd=00000000e3006003, *pud=00000000e3006003, *pmd=00000000e3007003, *pte=00e80000f0210707
    [ 2.903692] Internal error: Oops: 96000005 [#1] PREEMPT SMP

    It can be avoided by this patch.

    diff --git a/arch/arm64/boot/dts/marvell/armada-cp110-1.dtsi b/arch/arm64/boot/dts/marvell/armada-cp110-1.dtsi</blockquote>
    index 2d5374e5..0f3402b 100644
    --- a/arch/arm64/boot/dts/marvell/armada-cp110-1.dtsi
    +++ b/arch/arm64/boot/dts/marvell/armada-cp110-1.dtsi
    @@ -35,7 +35,7 @@ cps_syscon0: system-controller@440000 {
            #clock-cells = <2>;
            core-clock-output-names =
                    "cps-apll", "cps-ppv2-core", "cps-eip",
    -               "cps-core", "cps-nand-core";
    +               "cps-core", "cps-nand-core", "cps-emmc";
            gate-clock-output-names =
                    "cps-audio", "cps-communit", "cps-nand",
                    "cps-ppv2", "cps-sdio", "cps-mg-domain",

    At 2GHz:

    [ 0.220092] xor: using function: 8regs (7114.000 MB/sec)
    [ 1.360607] raid6: using algorithm neonx4 gen() 5284 MB/s
    [ 1.360610] raid6: …. xor() 3296 MB/s, rmw enabled

    But, Where is firmware for EIP197?

    #377
    nanasi
    Participant

    
    cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
    Report errors and bugs to cpufreq@vger.kernel.org, please.
    analyzing CPU 0:
      driver: cpufreq-dt
      CPUs which run at the same hardware frequency: 0 1
      CPUs which need to have their frequency coordinated by software: 0 1
      maximum transition latency: 50.0 us.
      hardware limits: 100.0 MHz - 2.00 GHz
      available frequency steps: 100.0 MHz, 667 MHz, 1000 MHz, 2.00 GHz
      available cpufreq governors: ondemand, userspace, performance
      current policy: frequency should be within 100.0 MHz and 2.00 GHz.
                      The governor "ondemand" may decide which speed to use
                      within this range.
      current CPU frequency is 100.0 MHz (asserted by call to hardware).
      cpufreq stats: 100.0 MHz:95.69%, 667 MHz:0.13%, 1000 MHz:0.01%, 2.00 GHz:4.17%  (153)

    tinymembench v0.4.9 (simple benchmark for memory throughput and latency)
    
    ==========================================================================
    == Memory bandwidth tests                                               ==
    ==                                                                      ==
    == Note 1: 1MB = 1000000 bytes                                          ==
    == Note 2: Results for 'copy' tests show how many bytes can be          ==
    ==         copied per second (adding together read and writen           ==
    ==         bytes would have provided twice higher numbers)              ==
    == Note 3: 2-pass copy means that we are using a small temporary buffer ==
    ==         to first fetch data into it, and only then write it to the   ==
    ==         destination (source -> L1 cache, L1 cache -> destination)    ==
    == Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
    ==         brackets                                                     ==
    ==========================================================================
    
     C copy backwards                                     :   4886.5 MB/s (13.4%)
     C copy backwards (32 byte blocks)                    :   4892.1 MB/s
     C copy backwards (64 byte blocks)                    :   4893.6 MB/s
     C copy                                               :   5151.9 MB/s (0.8%)
     C copy prefetched (32 bytes step)                    :   5143.6 MB/s
     C copy prefetched (64 bytes step)                    :   5143.1 MB/s
     C 2-pass copy                                        :   5236.0 MB/s
     C 2-pass copy prefetched (32 bytes step)             :   5133.2 MB/s
     C 2-pass copy prefetched (64 bytes step)             :   5131.1 MB/s
     C fill                                               :  15085.1 MB/s (0.1%)
     C fill (shuffle within 16 byte blocks)               :  15082.2 MB/s
     C fill (shuffle within 32 byte blocks)               :  15081.9 MB/s
     C fill (shuffle within 64 byte blocks)               :  14915.9 MB/s
     ---
     standard memcpy                                      :   5143.4 MB/s
     standard memset                                      :  15079.6 MB/s (0.1%)
     ---
     NEON LDP/STP copy                                    :   5160.0 MB/s
     NEON LDP/STP copy pldl2strm (32 bytes step)          :   5022.8 MB/s
     NEON LDP/STP copy pldl2strm (64 bytes step)          :   5030.1 MB/s
     NEON LDP/STP copy pldl1keep (32 bytes step)          :   5180.7 MB/s
     NEON LDP/STP copy pldl1keep (64 bytes step)          :   5170.7 MB/s
     NEON LD1/ST1 copy                                    :   5147.3 MB/s
     NEON STP fill                                        :  15085.2 MB/s (0.1%)
     NEON STNP fill                                       :  15063.4 MB/s
     ARM LDP/STP copy                                     :   5160.2 MB/s
     ARM STP fill                                         :  15078.5 MB/s (0.1%)
     ARM STNP fill                                        :  15059.0 MB/s
    
    ==========================================================================
    == Memory latency test                                                  ==
    ==                                                                      ==
    == Average time is measured for random memory accesses in the buffers   ==
    == of different sizes. The larger is the buffer, the more significant   ==
    == are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
    == accesses. For extremely large buffer sizes we are expecting to see   ==
    == page table walk with several requests to SDRAM for almost every      ==
    == memory access (though 64MiB is not nearly large enough to experience ==
    == this effect to its fullest).                                         ==
    ==                                                                      ==
    == Note 1: All the numbers are representing extra time, which needs to  ==
    ==         be added to L1 cache latency. The cycle timings for L1 cache ==
    ==         latency can be usually found in the processor documentation. ==
    == Note 2: Dual random read means that we are simultaneously performing ==
    ==         two independent memory accesses at a time. In the case if    ==
    ==         the memory subsystem can't handle multiple outstanding       ==
    ==         requests, dual random read has the same timings as two       ==
    ==         single reads performed one after another.                    ==
    ==========================================================================
    
    block size : single random read / dual random read, [MADV_NOHUGEPAGE]
          1024 :    0.0 ns          /     0.0 ns
          2048 :    0.0 ns          /     0.0 ns
          4096 :    0.0 ns          /     0.0 ns
          8192 :    0.0 ns          /     0.0 ns
         16384 :    0.0 ns          /     0.0 ns
         32768 :    0.0 ns          /     0.1 ns
         65536 :    3.0 ns          /     4.7 ns
        131072 :    4.6 ns          /     6.4 ns
        262144 :    7.0 ns          /     9.2 ns
        524288 :   10.0 ns          /    14.1 ns
       1048576 :   24.7 ns          /    34.5 ns
       2097152 :   64.5 ns          /    95.9 ns
       4194304 :  102.4 ns          /   140.2 ns
       8388608 :  126.6 ns          /   160.1 ns
      16777216 :  139.8 ns          /   174.2 ns
      33554432 :  147.0 ns          /   179.3 ns
      67108864 :  157.9 ns          /   193.8 ns
    
    block size : single random read / dual random read, [MADV_HUGEPAGE]
          1024 :    0.0 ns          /     0.0 ns
          2048 :    0.0 ns          /     0.0 ns
          4096 :    0.0 ns          /     0.0 ns
          8192 :    0.0 ns          /     0.0 ns
         16384 :    0.0 ns          /     0.0 ns
         32768 :    0.0 ns          /     0.1 ns
         65536 :    3.0 ns          /     4.7 ns
        131072 :    4.5 ns          /     6.3 ns
        262144 :    5.3 ns          /     6.9 ns
        524288 :    7.4 ns          /    10.3 ns
       1048576 :   21.9 ns          /    30.0 ns
       2097152 :   61.6 ns          /    92.0 ns
       4194304 :   99.4 ns          /   136.0 ns
       8388608 :  119.0 ns          /   152.2 ns
      16777216 :  128.3 ns          /   156.2 ns
      33554432 :  133.4 ns          /   159.8 ns
      67108864 :  137.4 ns          /   163.4 ns
    #378
    nanasi
    Participant

    $ openssl speed md5 sha1 sha256 sha512 des des-ede3 aes-128-cbc aes-192-cbc aes-256-cbc rsa2048 dsa2048 
    Doing md5 for 3s on 16 size blocks: 3134725 md5's in 3.00s
    Doing md5 for 3s on 64 size blocks: 3079787 md5's in 3.00s
    Doing md5 for 3s on 256 size blocks: 1963728 md5's in 3.00s
    Doing md5 for 3s on 1024 size blocks: 800008 md5's in 3.00s
    Doing md5 for 3s on 8192 size blocks: 122344 md5's in 3.00s
    Doing sha1 for 3s on 16 size blocks: 3509594 sha1's in 3.00s
    Doing sha1 for 3s on 64 size blocks: 2519881 sha1's in 3.00s
    Doing sha1 for 3s on 256 size blocks: 1376998 sha1's in 3.00s
    Doing sha1 for 3s on 1024 size blocks: 489943 sha1's in 3.00s
    Doing sha1 for 3s on 8192 size blocks: 69806 sha1's in 3.00s
    Doing sha256 for 3s on 16 size blocks: 3451309 sha256's in 3.00s
    Doing sha256 for 3s on 64 size blocks: 2126545 sha256's in 3.00s
    Doing sha256 for 3s on 256 size blocks: 1004309 sha256's in 3.00s
    Doing sha256 for 3s on 1024 size blocks: 320732 sha256's in 3.00s
    Doing sha256 for 3s on 8192 size blocks: 43691 sha256's in 3.00s
    Doing sha512 for 3s on 16 size blocks: 1779557 sha512's in 3.00s
    Doing sha512 for 3s on 64 size blocks: 1782716 sha512's in 3.00s
    Doing sha512 for 3s on 256 size blocks: 738074 sha512's in 3.00s
    Doing sha512 for 3s on 1024 size blocks: 268421 sha512's in 3.00s
    Doing sha512 for 3s on 8192 size blocks: 38734 sha512's in 3.00s
    Doing des cbc for 3s on 16 size blocks: 8007646 des cbc's in 3.00s
    Doing des cbc for 3s on 64 size blocks: 2231277 des cbc's in 3.00s
    Doing des cbc for 3s on 256 size blocks: 573989 des cbc's in 3.00s
    Doing des cbc for 3s on 1024 size blocks: 144630 des cbc's in 3.00s
    Doing des cbc for 3s on 8192 size blocks: 18088 des cbc's in 3.00s
    Doing des ede3 for 3s on 16 size blocks: 3336045 des ede3's in 2.99s
    Doing des ede3 for 3s on 64 size blocks: 861707 des ede3's in 3.00s
    Doing des ede3 for 3s on 256 size blocks: 217172 des ede3's in 3.00s
    Doing des ede3 for 3s on 1024 size blocks: 54420 des ede3's in 3.00s
    Doing des ede3 for 3s on 8192 size blocks: 6803 des ede3's in 3.00s
    Doing aes-128 cbc for 3s on 16 size blocks: 18784805 aes-128 cbc's in 3.00s
    Doing aes-128 cbc for 3s on 64 size blocks: 5063271 aes-128 cbc's in 3.00s
    Doing aes-128 cbc for 3s on 256 size blocks: 1296545 aes-128 cbc's in 3.00s
    Doing aes-128 cbc for 3s on 1024 size blocks: 337906 aes-128 cbc's in 3.00s
    Doing aes-128 cbc for 3s on 8192 size blocks: 42797 aes-128 cbc's in 3.00s
    Doing aes-192 cbc for 3s on 16 size blocks: 17173998 aes-192 cbc's in 3.00s
    Doing aes-192 cbc for 3s on 64 size blocks: 4600986 aes-192 cbc's in 3.00s
    Doing aes-192 cbc for 3s on 256 size blocks: 1127272 aes-192 cbc's in 3.00s
    Doing aes-192 cbc for 3s on 1024 size blocks: 291951 aes-192 cbc's in 3.00s
    Doing aes-192 cbc for 3s on 8192 size blocks: 36872 aes-192 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 16 size blocks: 15177650 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 64 size blocks: 4031749 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 256 size blocks: 1021850 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 1024 size blocks: 257192 aes-256 cbc's in 3.00s
    Doing aes-256 cbc for 3s on 8192 size blocks: 32237 aes-256 cbc's in 3.00s
    Doing 2048 bit private rsa's for 10s: 997 2048 bit private RSA's in 10.00s
    Doing 2048 bit public rsa's for 10s: 40023 2048 bit public RSA's in 10.00s
    Doing 2048 bit sign dsa's for 10s: 3515 2048 bit DSA signs in 10.00s
    Doing 2048 bit verify dsa's for 10s: 3317 2048 bit DSA verify in 10.00s
    OpenSSL 1.0.1t  3 May 2016
    built on: Fri Jan 27 00:08:40 2017
    options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr)
    compiler: gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wa,--noexecstack -Wall
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    md5              16718.53k    65702.12k   167571.46k   273069.40k   334080.68k
    sha1             18717.83k    53757.46k   117503.83k   167233.88k   190616.92k
    des cbc          42707.45k    47600.58k    48980.39k    49367.04k    49392.30k
    des ede3         17851.75k    18383.08k    18532.01k    18575.36k    18576.73k
    aes-128 cbc     100185.63k   108016.45k   110638.51k   115338.58k   116864.34k
    aes-192 cbc      91594.66k    98154.37k    96193.88k    99652.61k   100685.14k
    aes-256 cbc      80947.47k    86010.65k    87197.87k    87788.20k    88028.50k
    sha256           18406.98k    45366.29k    85701.03k   109476.52k   119305.56k
    sha512            9490.97k    38031.27k    62982.31k    91621.03k   105769.64k
                      sign    verify    sign/s verify/s
    rsa 2048 bits 0.010030s 0.000250s     99.7   4002.3
                      sign    verify    sign/s verify/s
    dsa 2048 bits 0.002845s 0.003015s    351.5    331.7
    
    |OpenSSL Version|MD5|SHA-1|SHA-256|SHA-512|DES|3DES|AES-128|AES-192|AES-256|RSA Sign|RSA Verify|DSA Sign|DSA Verify
    | 1.0.1t | 273069400 | 167233880 | 109476520 | 91621030 | 49367040 | 18575360 | 115338580 | 99652610 | 87788200 | 99.7 | 4002.3 | 351.5 | 331.7 |
    #379
    nanasi
    Participant

    After 42 hours idle,
    a plug-in electricity meter indicated increase of 0.70 kWh.
    About 17 watts.

    $ cpufreq-info  | egrep stats
      cpufreq stats: 100.0 MHz:98.95%, 667 MHz:0.02%, 1000 MHz:0.00%, 2.00 GHz:1.03%  (627)
      cpufreq stats: 100.0 MHz:98.95%, 667 MHz:0.02%, 1000 MHz:0.00%, 2.00 GHz:1.03%  (627)
      cpufreq stats: 100.0 MHz:99.90%, 667 MHz:0.01%, 1000 MHz:0.01%, 2.00 GHz:0.08%  (82)
      cpufreq stats: 100.0 MHz:99.90%, 667 MHz:0.01%, 1000 MHz:0.01%, 2.00 GHz:0.08%  (82)

    Used interfaces: uSDHC, 1GbE copper, console, DC Jack

    #399
    nanasi
    Participant

    Repeated quadruple tinymembench consume 0.60 kWh in 26 hours.
    About 23 watts.

    #401
    blu
    Participant

    Nice catch of the crash patch, nanasi!

    I haven’t upgraded to the new kernel yet, so I have a request: can you, please, check if this new kernel has a certain performance event working via perf? (perf itself can be built at tools/perf in the kernel tree)

    It appears that under 4.4.8-armada-17.02.2 the hw branch-instructions event is not working:

    $ perf list
    
    List of pre-defined events (to be used in -e):
    
      branch-misses                                      [Hardware event]
      bus-cycles                                         [Hardware event]
      cache-misses                                       [Hardware event]
      cache-references                                   [Hardware event]
      cpu-cycles OR cycles                               [Hardware event]
      instructions                                       [Hardware event]
    
    ...
      LLC-load-misses                                    [Hardware cache event]
      LLC-loads                                          [Hardware cache event]
      LLC-store-misses                                   [Hardware cache event]
      LLC-stores                                         [Hardware cache event]
    
      armv8_cortex_a72/br_immed_retired/                 [Kernel PMU event]
      armv8_cortex_a72/br_mis_pred/                      [Kernel PMU event]
      armv8_cortex_a72/br_mis_pred_retired/              [Kernel PMU event]
      armv8_cortex_a72/br_pred/                          [Kernel PMU event]
      armv8_cortex_a72/br_retired/                       [Kernel PMU event]
    ...
    

    Notice how a hw event for branches is missing? It should the same event as the last in the list above – armv8_cortex_a72/br_retired/ – and that one returns invariably 0 on 4.4.8-armada-17.02.2, which means branch mis-prediction rates cannot be measured.

    You can try the event via something like:

    $ perf stat -e task-clock,cycles,instructions,branch-misses,armv8_cortex_a72/inst_retired/,armv8_cortex_a72/br_retired/,armv8_cortex_a72/br_mis_pred/ -- sleep 0
    
     Performance counter stats for 'sleep 0':
    
              0.772920      task-clock (msec)         #    0.622 CPUs utilized          
               990,549      cycles                    #    1.282 GHz                    
               558,333      instructions              #    0.56  insns per cycle        
                 7,793      branch-misses             #   10.083 M/sec                  
               558,333      armv8_cortex_a72/inst_retired/ #  722.368 M/sec                  
                     0      armv8_cortex_a72/br_retired/ #    0.000 K/sec                  
                 7,793      armv8_cortex_a72/br_mis_pred/ #   10.083 M/sec                  
    
           0.001242367 seconds time elapsed
    
    #403
    nanasi
    Participant

    I think event BR_RETIRED is not implemented.
    Resister PMCEID1_EL0 is always zero.
    AArch64 Performance Monitors registers

    $ perf stat -e task-clock,cycles,instructions,branch-misses,armv8_cortex_a72/inst_retired/,armv8_cortex_a72/br_retired/,armv8_cortex_a72/br_mis_pred/ -- sleep 0
    
     Performance counter stats for 'sleep 0':
    
              7.109840      task-clock (msec)         #    0.652 CPUs utilized
                699200      cycles                    #    0.098 GHz
                491727      instructions              #    0.70  insns per cycle
                  8059      branch-misses             #    1.133 M/sec
                491727      armv8_cortex_a72/inst_retired/ #   69.161 M/sec
                     0      armv8_cortex_a72/br_retired/ #    0.000 K/sec
                  8059      armv8_cortex_a72/br_mis_pred/ #    1.133 M/sec
    
           0.010912797 seconds time elapsed
    #407
    blu
    Participant

    So it’s still not implemented in the latest kernel. Thank you, nanasi!

    #408
    nanasi
    Participant

    Unfortunately, the Cortex-A72 is not supports some event,
    So kernel cannot report them.
    Newer kernel(>=4.7) will check whether it is available or not.

    Is it enough to use BR_PRED & BR_MIS_PRED?

    #411
    blu
    Participant

    Using BR_PRED and BR_MIS_PRED is what occurred to me when I first discovered BR_RETIRED did not work, but it’s not so straight to do that.

    Consider the following simple example that reads a buffer of 1M random bytes (obtained in advance from /dev/urand) and re-interprets those as 8M random bits, for the sake of a dummy loop:

    #include <stdio.h>
    #include <stdint.h>
    #include <stdlib.h>
    
    int main(int, char**) {
        const size_t len = 1 << 20;
        char *buf = (char*) malloc(len);
    
        FILE *f = fopen("rand", "rb");
        fread(buf, len, 1, f);
        fclose(f);
    
        const uint64_t reps = uint64_t(1e8) * 4;
        const uint64_t bpp = sizeof(*buf) * 8;
    
        for (uint64_t i = 0; i < reps; ++i) {
            if (buf[i / bpp % len] & (1 << i % bpp)) {
                asm volatile ("" : : : "memory");
                continue;
            }
        }
    
        free(buf);
        return 0;
    }
    

    The loop in that example compiles (via clang++-3.5.2) to:

    
      4007c0:       d343590a        ubfx    x10, x8, #3, #20
      4007c4:       386a6a6a        ldrb    w10, [x19,x10]
      4007c8:       1200090b        and     w11, w8, #0x7
      4007cc:       1acb22ab        lsl     w11, w21, w11
      4007d0:       0a0b014a        and     w10, w10, w11
      4007d4:       3400002a        cbz     w10, 4007d8 <main+0x78>
      4007d8:       91000508        add     x8, x8, #0x1
      4007dc:       eb09011f        cmp     x8, x9
      4007e0:       54ffff01        b.ne    4007c0 <main+0x60>
    

    As you see, the loop contains two conditional branches – at 4007d4 and at 4007e0. So 400M loop iterations, by 2 cond. branches per iteration should result in 800M conditional branches. But running the example through perf produces:

    $ perf stat -e task-clock,cycles,instructions,armv8_cortex_a72/inst_retired/,armv8_cortex_a72/br_mis_pred/,armv8_cortex_a72/br_pred/ -- ./a.out 
    ,
     Performance counter stats for './a.out':
    
           4515.072280      task-clock (msec)         #    1.000 CPUs utilized          
         5,869,565,057      cycles                    #    1.300 GHz                    
         3,608,087,165      instructions              #    0.61  insns per cycle        
         3,608,087,165      armv8_cortex_a72/inst_retired/ #  799.121 M/sec                  
           200,045,882      armv8_cortex_a72/br_mis_pred/ #   44.306 M/sec                  
           963,788,373      armv8_cortex_a72/br_pred/ #  213.460 M/sec                  
    
           4.515805921 seconds time elapsed
    

    Notice the discrepancy? 960M + 200M = 1,160M, which is way more than the expected 800M. Actually, via more experiments with the loop count, I’ve come to the realisation that the correct expression would be BR_PRED - BR_MIS_PRED, which would produce a slight underestimation (in our case 760M vs 800M). Curiously enough, that underestimation happens only when there are mispredicted branches – if we eliminated the random-predicate branch from within the loop body, we’d get a _very_ accurate result for the predicted branches, namely 400M!

    So that method of combining BR_PRED & BR_MIS_PRED is better than nothing, but still a more precise counter should be available somewhere in the system, IMO.

    #412
    blu
    Participant

    Just to give an example of what I’m after, this is the same test run on a Xeon (nanasi, I promise this is my last interference with your thread!):

    $ perf stat -e task-clock,cycles,instructions,branches,branch-misses -- ./a.out
    
     Performance counter stats for './a.out':
    
           2050.547118      task-clock (msec)         #    0.999 CPUs utilized
         6,362,850,183      cycles                    #    3.103 GHz
         4,809,362,376      instructions              #    0.76  insns per cycle
           801,642,789      branches                  #  390.941 M/sec
           199,930,425      branch-misses             #   24.94% of all branches
    
           2.052832966 seconds time elapsed
    

    Branch misses amounting to 200M is clearly correct on both platforms, but we need a reliable way to get the altogether branch count on the A72.

    ps: Fun fact: notice how the ‘RISC’ A72 needs both less instructions and less cycles for this test than the ‘CISC’ Xeon? ; )

    #413
    nanasi
    Participant

    ARMv8’s BR_PRED & BR_MIS_PRED count “Speculatively executed” operations,
    not “Architecturally executed” instructions.

    I have no info whether “pc sampling profiler” is practicable under that case.

    #414
    blu
    Participant

    Well, on a principle level, the performance-counted evens should provide a reliable means to get the total count of predicted branches – that is a fundamental metric. Otherwise essential profiling metrics such as the ratio of mispredicted_branches / total_predicted_branches become non-computable.

    I have not given up on the idea that BR_PRED & BR_MIS_PRED could be used as the principal means to get that branch prediction rate, but I am missing all necessary factors to get a reliable computation. In that simple test the BR_MIS_PRED / (BR_PRED - BR_MIS_PRED) formula yields an error of 20 / (96 – 20) / 0.25 = 5% error (0.25 is the correct misprediction rate for that test). That is better than taking the unmodified count from BR_PRED, i.e. BR_MIS_PRED / BR_PRED, which yields an error of 20 / 96 / 0.25 = 17% error. But even 5% is too much for effective performance tuning where %5 could be the entire gain from a given optimisation.

    #440
    blu
    Participant

    Just an update on branch-prediction PMU counters in kernel 4.4.52-armada-17.06.2-gcaa3a4f + U-Boot 2017.03-armada-17.06.3-ga33ecb8:

    $ perf stat -e task-clock,cycles,instructions,branch-misses,armv8_cortex_a72/inst_retired/,armv8_cortex_a72/br_pred/,armv8_cortex_a72/br_mis_pred/ -- ./a.out 
    
     Performance counter stats for './a.out':
    
           4500.194400      task-clock (msec)         #    1.000 CPUs utilized          
         5,850,215,516      cycles                    #    1.300 GHz                    
         3,205,528,984      instructions              #    0.55  insns per cycle        
           200,035,171      branch-misses             #   44.450 M/sec                  
         3,205,528,984      armv8_cortex_a72/inst_retired/ #  712.309 M/sec                  
         1,001,588,793      armv8_cortex_a72/br_pred/ #  222.566 M/sec                  
           200,035,171      armv8_cortex_a72/br_mis_pred/ #   44.450 M/sec                  
    
           4.501076183 seconds time elapsed
    

    So by the earlier formula:

    armv8_cortex_a72/br_pred/armv8_cortex_a72/br_mis_pred/ = 800M

    armv8_cortex_a72/br_mis_pred/ / (armv8_cortex_a72/br_pred/armv8_cortex_a72/br_mis_pred/) = 200M / 800M = 0.25

    Which is the expected result. I’m happy : )

    #441
    blu
    Participant

    An update to the update: I spoke too soon : /

    It appears there’s an issue with the armv8_cortex_a72/br_pred/ event, depending on what actual branch instructions were used.

    This loop produces an underestimate of armv8_cortex_a72/br_pred/ (960M events):

    
      4007d0:       d343590a        ubfx    x10, x8, #3, #20
      4007d4:       386a6a6a        ldrb    w10, [x19,x10]
      4007d8:       1200090b        and     w11, w8, #0x7
      4007dc:       1acb22ab        lsl     w11, w21, w11
      4007e0:       0a0b014a        and     w10, w10, w11
      4007e4:       3400002a        cbz     w10, 4007e8 <main+0x78>
      4007e8:       91000508        add     x8, x8, #0x1
      4007ec:       eb09011f        cmp     x8, x9
      4007f0:       54ffff01        b.ne    4007d0 <main+0x60>
    

    But this loop does not (it registers 1000M events):

    
      400628:       d3435802        ubfx    x2, x0, #3, #20
      40062c:       12000801        and     w1, w0, #0x7
      400630:       38626a62        ldrb    w2, [x19,x2]
      400634:       1ac12841        asr     w1, w2, w1
      400638:       36000021        tbz     w1, #0, 40063c <main+0x6c>
      40063c:       91000400        add     x0, x0, #0x1
      400640:       eb03001f        cmp     x0, x3
      400644:       54ffff21        b.ne    400628 <main+0x58>
    
Viewing 15 posts - 1 through 15 (of 16 total)
  • You must be logged in to reply to this topic.

Technical specification tables can not be displayed on mobile. Please view on desktop