Home Forum Software discussion Coherency problem with CCI setup?

Viewing 15 posts - 1 through 15 (of 15 total)
  • Author
    Posts
  • #428
    blu
    Participant

    Hi,

    I came across an ‘Unhandled level 2 translation fault’ while running clang++-3.5 on some trivial code (has not happened with any other compiler so far, clang or gcc). Here’s the relevant dmesg portion:

    
    [  355.152552] clang[2319]: unhandled level 2 translation fault (11) at 0x00000020, esr 0x92000006
    [  355.152568] pgd = ffffffc0e17f6000
    [  355.156060] [00000020] *pgd=00000000e03aa003, *pud=00000000e03aa003, *pmd=0000000000000000
    
    [  355.164496] CPU: 2 PID: 2319 Comm: clang Not tainted 4.4.8-armada-17.02.2-g4126e30 #2
    [  355.164507] Hardware name: Marvell 8040 MACHIATOBin (DT)
    [  355.164519] task: ffffffc0e63fd800 ti: ffffffc0e0354000 task.ti: ffffffc0e0354000
    [  355.164531] PC is at 0x7f8dd72d64
    [  355.164540] LR is at 0x7f8e4f0dec
    [  355.164551] pc : [<0000007f8dd72d64>] lr : [<0000007f8e4f0dec>] pstate: 00000000
    [  355.164559] sp : 0000007febe80150
    [  355.164568] x29: 0000007febe82900 x28: 0000000000000000 
    [  355.164585] x27: 000000001bfdecf0 x26: 00000000000000cf 
    [  355.164600] x25: 0000000000000000 x24: 0000007febe802a8 
    [  355.164615] x23: 0000000000000000 x22: 0000007f8ed6b000 
    [  355.164628] x21: 0000007f8ed6b000 x20: 0000007febe802a8 
    [  355.164642] x19: 000000001bfded20 x18: 0000007f8e8fb35e 
    [  355.164655] x17: 0000007f8d45afb0 x16: 0000007f8ed6c3c8 
    [  355.164668] x15: 0000000000000000 x14: 000000001bfc2730 
    [  355.164682] x13: 000000001bfdff04 x12: 000000001bfc27dc 
    [  355.164695] x11: 0000007f8e4f0c18 x10: 0000007f8ed61918 
    [  355.164709] x9 : 000000001bfc27d0 x8 : 000000001bfd3f90 
    [  355.164722] x7 : 0000000000000000 x6 : 0000000000000000 
    [  355.164735] x5 : 0000000000000002 x4 : 0000000000000001 
    [  355.164748] x3 : 0000007f8edb1cb8 x2 : 0000007febe802a8 
    [  355.164761] x1 : 0000000000000000 x0 : 0000000000000000
    

    And here’s a discussion that seems related: https://patchwork.kernel.org/patch/8120651/
    The most elaborate quote from there says:

    It looks like the TLB invalidation messages may not get across the CCI
    between clusters. I don’t have the TRMs at hand but make sure all the
    relevant bits in the CPUs and CCI are enabled.

    So it appears to be a coherency problem. One across the clusters, perhaps?

    Has anybody else stepped on this?

    #429
    blu
    Participant

    Forgot to mention that it happens deterministically – I can reproduce that fault at will.

    #438
    blu
    Participant

    Update: same behavior under U-Boot 2017.03-armada-17.06.3-ga33ecb8 and kernel 4.4.52:

    [  301.928075] clang[2630]: unhandled level 2 translation fault (11) at 0x00000020, esr 0x92000006
    [  301.928087] pgd = ffffffc0dd45e000
    [  301.931511] [00000020] *pgd=00000000de715003, *pud=00000000de715003, *pmd=0000000000000000
    
    [  301.939940] CPU: 2 PID: 2630 Comm: clang Not tainted 4.4.52-armada-17.06.2-gcaa3a4f #1
    [  301.939948] Hardware name: Marvell 8040 MACHIATOBin (DT)
    [  301.939956] task: ffffffc0de74a280 ti: ffffffc0dd748000 task.ti: ffffffc0dd748000
    [  301.939964] PC is at 0x7fa83f1d64
    [  301.939970] LR is at 0x7fa8b6fdec
    [  301.939978] pc : [<0000007fa83f1d64>] lr : [<0000007fa8b6fdec>] pstate: 00000000
    [  301.939983] sp : 0000007fdfc1c1e0
    [  301.939990] x29: 0000007fdfc1e990 x28: 0000000000000000 
    [  301.940001] x27: 0000000027029780 x26: 00000000000000cf 
    [  301.940012] x25: 0000000000000000 x24: 0000007fdfc1c338 
    [  301.940022] x23: 0000000000000000 x22: 0000007fa93ea000 
    [  301.940031] x21: 0000007fa93ea000 x20: 0000007fdfc1c338 
    [  301.940040] x19: 00000000270297b0 x18: 0000007fa8f7a35a 
    [  301.940049] x17: 0000007fa7ad9fb0 x16: 0000007fa93eb3c8 
    [  301.940058] x15: 0000000000000000 x14: 0000000000000014 
    [  301.940067] x13: 0000000027013520 x12: 0000000026fd9f98 
    [  301.940076] x11: 0000007fa8b6fc18 x10: 0000007fa93e0688 
    [  301.940085] x9 : 0000000026fd9f90 x8 : 0000000026ffd850 
    [  301.940094] x7 : 0000000000000000 x6 : 0000000000000000 
    [  301.940102] x5 : 0000000000000002 x4 : 0000000000000001 
    [  301.940111] x3 : 0000007fa9430cb8 x2 : 0000007fdfc1c338 
    [  301.940120] x1 : 0000000000000000 x0 : 0000000000000000 
    

    Clang is:

    $ clang++-3.5 --version
    Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
    Target: aarch64-unknown-linux-gnu
    Thread model: posix
    

    And code is hello-world.

    #439
    blu
    Participant

    Another update: the issue does not occur with the pre-built clang-3.5.2 from http://releases.llvm.org/download.html#3.5.2

    So, issue is reproducible with:

    $ clang++-3.5 --version
    Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
    Target: aarch64-unknown-linux-gnu
    Thread model: posix

    and not reproducible with:

    $ ~/clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang++ --version
    clang version 3.5.2 (tags/RELEASE_352/final)
    Target: aarch64-unknown-linux-gnu
    Thread model: posix
    #443
    nanasi
    Participant

    I can reproduce it.
    (Ubuntu clang version 3.5.2-3ubuntu1/16.04.02/16.04.03/U-Boot 2015.01-devel-17.04.1-gf964c08)

    Since reproducibility and no effectiveness of the nosmp boot option, etc.
    I suppose that it is not a bug of the kernel.

    #444
    blu
    Participant

    Thank you, nanasi. I too think it’s a fw issue. So basically we have reproducibility under fw versions 17.02, 17.04 and 17.06.

    #446
    nanasi
    Participant

    The SEGV occurs with qemu-aarch64 too.
    The gcc-5 is one of the suspects.

    0  libLLVM-3.5.so.1 0x00000040019fc1c4 llvm::sys::PrintStackTrace(_IO_FILE*) + 68
    <snip>
    1.      <eof> parser at end of file
    2.      Code generation
    3.      Running pass 'Function Pass Manager' on module 'c.cpp'.
    4.      Running pass 'Fast Register Allocator' on function '@__cxx_global_var_init'
    qemu: uncaught target signal 11 (Segmentation fault) - core dumped
    clang: error: unable to execute command: Segmentation fault
    clang: error: clang frontend command failed due to signal (use -v to see invocation)
    Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
    Target: aarch64-unknown-linux-gnu
    Thread model: posix
    <snip>
    #496
    travisg
    Participant

    Any updates on this? I’m seeing a similar rare lockup inside clang, and the default firmware is fairly old (17.02 I believe).

    As far as I can tell the board is stable otherwise, but it seems that this particular case is out of the blue and fairly well reproducible.

    I haven’t updated the firmware yet since it seems fairly perilous and unclear if it’d fix anything.

    #505
    blu
    Participant

    Ok, I’ve stepped on this again, only this time it happens in LLVM 4.0.

    It was to be expected that if such a fundamental issue exists, it would become a serious obstacle one day. For me that day has come, as issue renders the macchiatobin unusable for my current project.

    #506
    nanasi
    Participant

    In regard to the hello-world SEGV,
    It is also reproducible under Linux for Tegra R24 with Tegra X1.

    $ uname -a
    Linux tegra-ubuntu 3.10.96 #1 SMP PREEMPT Thu Oct 13 05:30:55 EDT 2016 aarch64 aarch64 aarch64 GNU/Linux
    $ clang-3.5 /tmp/hw.c
    <snip>
    clang: error: unable to execute command: Segmentation fault
    clang: error: clang frontend command failed due to signal (use -v to see invocation)
    Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
    Target: aarch64-unknown-linux-gnu
    Thread model: posix
    clang: note: diagnostic msg: PLEASE submit a bug report to http://bugs.debian.org/ 
     and include the crash backtrace, preprocessed source, and associated run script.
    clang: note: diagnostic msg:
    ********************
    
    PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
    Preprocessed source(s) and associated run script(s) are located at:
    clang: note: diagnostic msg: /tmp/c-5f3dd7.c
    clang: note: diagnostic msg: /tmp/c-5f3dd7.sh
    clang: note: diagnostic msg:
    
    ********************
    #508
    blu
    Participant

    Nanasi, can you check if it’s also reproducible with the pre-build clang-3.5.2 from llvm.org?

    Just to give some context to my latest encounter of the issue on the macchiatobin:

    To rule out gcc-5, I’ve built LLVM/Clang 4.0.1 from master with clang-4.0.1 from llvm.org, and subsequently built pocl (an OSS implementation of OCL) with the so-built local clang. The issue manifests in a strange pattern: the first time I build and run my app the OCL kernel builds ok at run-time; from then on every subsequent attempt at running the app ends up with an ‘unhandled level 2 translation fault’ during the kernel compile phase.

    #511
    nanasi
    Participant

    The pre-build binaries run ordinarily. (Armada & Tegra)
    Not tested with complex jobs.

    $ file -L /usr/bin/clang-3.5
    /usr/bin/clang-3.5: ELF 64-bit LSB executable,
     ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1,
     for GNU/Linux 3.7.0, BuildID[sha1]=82c1d783725ea6762330f906a6fdaf502e73837a, stripped
    
    $ file ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang
    ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang: ELF 64-bit LSB executable,
     ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1,
     for GNU/Linux 3.7.0, BuildID[sha1]=64ec2142d8d9d26f4bde7c152525206c9b21c052, stripped
    
    $ ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang --version
    clang version 3.5.2 (tags/RELEASE_352/final)
    Target: aarch64-unknown-linux-gnu
    Thread model: posix
    
    $ ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang hw.c
    $ ./a.out
    hello
    #516
    blu
    Participant

    Thanks, nanasi.

    Since you mentioned TX1, I too encountered the issue in two more ARMs. That rabbit hole goes deeper and deeper.

    #525
    blu
    Participant

    For the record, I’ve moved on to another LLVM, where the issue is not present, so crisis averted. And anyway, this does not look to me any longer like an ARMADA8040 issue as much as a general libc/compiler issue, so clearly finding workarounds seems more productive than waiting on Marvell to address it.

    #536
    travisg
    Participant

    Thanks, yeah I think I’ve settled on the same conclusion. Everything else about the board seems totally stable, so it seems highly unlikely that it only manifests in clang.

    The only reason I’d think that clang was a canary for a deeper problem is that AFAIK it does a bit of multithreading in an early phase of the compiler, and if there were some sort of TLB sync issue that may be a fairly good canary for it. Large process, probably faulting in lots of pages with a bunch of threads running simultaneously would be a good test. Not a lot of apps behave that way.

    But for now this thing seems stable and useful.

Viewing 15 posts - 1 through 15 (of 15 total)
  • You must be logged in to reply to this topic.

Technical specification tables can not be displayed on mobile. Please view on desktop