Home › Forum › Software discussion › Coherency problem with CCI setup?
Hi,
I came across an ‘Unhandled level 2 translation fault’ while running clang++-3.5 on some trivial code (has not happened with any other compiler so far, clang or gcc). Here’s the relevant dmesg portion:
[ 355.152552] clang[2319]: unhandled level 2 translation fault (11) at 0x00000020, esr 0x92000006
[ 355.152568] pgd = ffffffc0e17f6000
[ 355.156060] [00000020] *pgd=00000000e03aa003, *pud=00000000e03aa003, *pmd=0000000000000000
[ 355.164496] CPU: 2 PID: 2319 Comm: clang Not tainted 4.4.8-armada-17.02.2-g4126e30 #2
[ 355.164507] Hardware name: Marvell 8040 MACHIATOBin (DT)
[ 355.164519] task: ffffffc0e63fd800 ti: ffffffc0e0354000 task.ti: ffffffc0e0354000
[ 355.164531] PC is at 0x7f8dd72d64
[ 355.164540] LR is at 0x7f8e4f0dec
[ 355.164551] pc : [<0000007f8dd72d64>] lr : [<0000007f8e4f0dec>] pstate: 00000000
[ 355.164559] sp : 0000007febe80150
[ 355.164568] x29: 0000007febe82900 x28: 0000000000000000
[ 355.164585] x27: 000000001bfdecf0 x26: 00000000000000cf
[ 355.164600] x25: 0000000000000000 x24: 0000007febe802a8
[ 355.164615] x23: 0000000000000000 x22: 0000007f8ed6b000
[ 355.164628] x21: 0000007f8ed6b000 x20: 0000007febe802a8
[ 355.164642] x19: 000000001bfded20 x18: 0000007f8e8fb35e
[ 355.164655] x17: 0000007f8d45afb0 x16: 0000007f8ed6c3c8
[ 355.164668] x15: 0000000000000000 x14: 000000001bfc2730
[ 355.164682] x13: 000000001bfdff04 x12: 000000001bfc27dc
[ 355.164695] x11: 0000007f8e4f0c18 x10: 0000007f8ed61918
[ 355.164709] x9 : 000000001bfc27d0 x8 : 000000001bfd3f90
[ 355.164722] x7 : 0000000000000000 x6 : 0000000000000000
[ 355.164735] x5 : 0000000000000002 x4 : 0000000000000001
[ 355.164748] x3 : 0000007f8edb1cb8 x2 : 0000007febe802a8
[ 355.164761] x1 : 0000000000000000 x0 : 0000000000000000
And here’s a discussion that seems related: https://patchwork.kernel.org/patch/8120651/
The most elaborate quote from there says:
It looks like the TLB invalidation messages may not get across the CCI
between clusters. I don’t have the TRMs at hand but make sure all the
relevant bits in the CPUs and CCI are enabled.
So it appears to be a coherency problem. One across the clusters, perhaps?
Has anybody else stepped on this?
Forgot to mention that it happens deterministically – I can reproduce that fault at will.
Update: same behavior under U-Boot 2017.03-armada-17.06.3-ga33ecb8
and kernel 4.4.52:
[ 301.928075] clang[2630]: unhandled level 2 translation fault (11) at 0x00000020, esr 0x92000006
[ 301.928087] pgd = ffffffc0dd45e000
[ 301.931511] [00000020] *pgd=00000000de715003, *pud=00000000de715003, *pmd=0000000000000000
[ 301.939940] CPU: 2 PID: 2630 Comm: clang Not tainted 4.4.52-armada-17.06.2-gcaa3a4f #1
[ 301.939948] Hardware name: Marvell 8040 MACHIATOBin (DT)
[ 301.939956] task: ffffffc0de74a280 ti: ffffffc0dd748000 task.ti: ffffffc0dd748000
[ 301.939964] PC is at 0x7fa83f1d64
[ 301.939970] LR is at 0x7fa8b6fdec
[ 301.939978] pc : [<0000007fa83f1d64>] lr : [<0000007fa8b6fdec>] pstate: 00000000
[ 301.939983] sp : 0000007fdfc1c1e0
[ 301.939990] x29: 0000007fdfc1e990 x28: 0000000000000000
[ 301.940001] x27: 0000000027029780 x26: 00000000000000cf
[ 301.940012] x25: 0000000000000000 x24: 0000007fdfc1c338
[ 301.940022] x23: 0000000000000000 x22: 0000007fa93ea000
[ 301.940031] x21: 0000007fa93ea000 x20: 0000007fdfc1c338
[ 301.940040] x19: 00000000270297b0 x18: 0000007fa8f7a35a
[ 301.940049] x17: 0000007fa7ad9fb0 x16: 0000007fa93eb3c8
[ 301.940058] x15: 0000000000000000 x14: 0000000000000014
[ 301.940067] x13: 0000000027013520 x12: 0000000026fd9f98
[ 301.940076] x11: 0000007fa8b6fc18 x10: 0000007fa93e0688
[ 301.940085] x9 : 0000000026fd9f90 x8 : 0000000026ffd850
[ 301.940094] x7 : 0000000000000000 x6 : 0000000000000000
[ 301.940102] x5 : 0000000000000002 x4 : 0000000000000001
[ 301.940111] x3 : 0000007fa9430cb8 x2 : 0000007fdfc1c338
[ 301.940120] x1 : 0000000000000000 x0 : 0000000000000000
Clang is:
$ clang++-3.5 --version
Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
Target: aarch64-unknown-linux-gnu
Thread model: posix
And code is hello-world.
Another update: the issue does not occur with the pre-built clang-3.5.2 from http://releases.llvm.org/download.html#3.5.2
So, issue is reproducible with:
$ clang++-3.5 --version
Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
Target: aarch64-unknown-linux-gnu
Thread model: posix
and not reproducible with:
$ ~/clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang++ --version
clang version 3.5.2 (tags/RELEASE_352/final)
Target: aarch64-unknown-linux-gnu
Thread model: posix
I can reproduce it.
(Ubuntu clang version 3.5.2-3ubuntu1/16.04.02/16.04.03/U-Boot 2015.01-devel-17.04.1-gf964c08)
Since reproducibility and no effectiveness of the nosmp boot option, etc.
I suppose that it is not a bug of the kernel.
Thank you, nanasi. I too think it’s a fw issue. So basically we have reproducibility under fw versions 17.02, 17.04 and 17.06.
The SEGV occurs with qemu-aarch64 too.
The gcc-5 is one of the suspects.
0 libLLVM-3.5.so.1 0x00000040019fc1c4 llvm::sys::PrintStackTrace(_IO_FILE*) + 68
<snip>
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'c.cpp'.
4. Running pass 'Fast Register Allocator' on function '@__cxx_global_var_init'
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
clang: error: unable to execute command: Segmentation fault
clang: error: clang frontend command failed due to signal (use -v to see invocation)
Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
Target: aarch64-unknown-linux-gnu
Thread model: posix
<snip>
Any updates on this? I’m seeing a similar rare lockup inside clang, and the default firmware is fairly old (17.02 I believe).
As far as I can tell the board is stable otherwise, but it seems that this particular case is out of the blue and fairly well reproducible.
I haven’t updated the firmware yet since it seems fairly perilous and unclear if it’d fix anything.
Ok, I’ve stepped on this again, only this time it happens in LLVM 4.0.
It was to be expected that if such a fundamental issue exists, it would become a serious obstacle one day. For me that day has come, as issue renders the macchiatobin unusable for my current project.
In regard to the hello-world SEGV,
It is also reproducible under Linux for Tegra R24 with Tegra X1.
$ uname -a
Linux tegra-ubuntu 3.10.96 #1 SMP PREEMPT Thu Oct 13 05:30:55 EDT 2016 aarch64 aarch64 aarch64 GNU/Linux
$ clang-3.5 /tmp/hw.c
<snip>
clang: error: unable to execute command: Segmentation fault
clang: error: clang frontend command failed due to signal (use -v to see invocation)
Ubuntu clang version 3.5.2-3ubuntu1 (tags/RELEASE_352/final) (based on LLVM 3.5.2)
Target: aarch64-unknown-linux-gnu
Thread model: posix
clang: note: diagnostic msg: PLEASE submit a bug report to http://bugs.debian.org/
and include the crash backtrace, preprocessed source, and associated run script.
clang: note: diagnostic msg:
********************
PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /tmp/c-5f3dd7.c
clang: note: diagnostic msg: /tmp/c-5f3dd7.sh
clang: note: diagnostic msg:
********************
Nanasi, can you check if it’s also reproducible with the pre-build clang-3.5.2 from llvm.org?
Just to give some context to my latest encounter of the issue on the macchiatobin:
To rule out gcc-5, I’ve built LLVM/Clang 4.0.1 from master with clang-4.0.1 from llvm.org, and subsequently built pocl (an OSS implementation of OCL) with the so-built local clang. The issue manifests in a strange pattern: the first time I build and run my app the OCL kernel builds ok at run-time; from then on every subsequent attempt at running the app ends up with an ‘unhandled level 2 translation fault’ during the kernel compile phase.
The pre-build binaries run ordinarily. (Armada & Tegra)
Not tested with complex jobs.
$ file -L /usr/bin/clang-3.5
/usr/bin/clang-3.5: ELF 64-bit LSB executable,
ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1,
for GNU/Linux 3.7.0, BuildID[sha1]=82c1d783725ea6762330f906a6fdaf502e73837a, stripped
$ file ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang
./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang: ELF 64-bit LSB executable,
ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1,
for GNU/Linux 3.7.0, BuildID[sha1]=64ec2142d8d9d26f4bde7c152525206c9b21c052, stripped
$ ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang --version
clang version 3.5.2 (tags/RELEASE_352/final)
Target: aarch64-unknown-linux-gnu
Thread model: posix
$ ./clang+llvm-3.5.2-aarch64-linux-gnu/bin/clang hw.c
$ ./a.out
hello
Thanks, nanasi.
Since you mentioned TX1, I too encountered the issue in two more ARMs. That rabbit hole goes deeper and deeper.
For the record, I’ve moved on to another LLVM, where the issue is not present, so crisis averted. And anyway, this does not look to me any longer like an ARMADA8040 issue as much as a general libc/compiler issue, so clearly finding workarounds seems more productive than waiting on Marvell to address it.
Thanks, yeah I think I’ve settled on the same conclusion. Everything else about the board seems totally stable, so it seems highly unlikely that it only manifests in clang.
The only reason I’d think that clang was a canary for a deeper problem is that AFAIK it does a bit of multithreading in an early phase of the compiler, and if there were some sort of TLB sync issue that may be a fairly good canary for it. Large process, probably faulting in lots of pages with a bunch of threads running simultaneously would be a good test. Not a lot of apps behave that way.
But for now this thing seems stable and useful.
Technical specification tables can not be displayed on mobile. Please view on desktop