@@ 0,0 1,117 @@
+---
+title: Investigating a SIGBUS in a Go program
+---
+
+I have been happily using Gokrazy on an old Raspberry Pi.
+
+=> /2022/go-ing-krazy-raspberry-pi-model-b/ Go(ing)krazy on a Raspberry Pi (Model B)
+
+One of the package that I use is a time series database called VictoriaMetrics (similar to Prometheus).
+
+=> https://victoriametrics.com/
+
+This setup runs pretty well and it gathers every minute:
+* temperature in every room (via the FRITZ!DECT 301)
+* electricity on some appliance (via Shelly Plug S)
+* Internet connectivity statistics (via FRITZ!Box)
+
+However when I repeatedly looked at some metrics (using Grafana), VictoriaMetrics sometimes encountered a fatal error:
+
+```
+fatal error: fault
+[signal SIGBUS: bus error code=0x1 addr=0x0 pc=0x6b2168]
+
+goroutine 111 [running]:
+runtime.throw({0xa8fb4f, 0x5})
+ /usr/lib/go/src/runtime/panic.go:1047 +0x4c fp=0x2d8ec78 sp=0x2d8ec64 pc=0x4dcc0
+runtime.sigpanic()
+ /usr/lib/go/src/runtime/signal_unix.go:834 +0xa8 fp=0x2d8ec94 sp=0x2d8ec78 pc=0x67ccc
+math.IsNaN(...)
+ /usr/lib/go/src/math/bits.go:39
+github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.removeEmptySeries(...)
+ /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/exec.go:180
+github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.newBinaryOpFunc.func1(0x237af20)
+ /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/binary_op.go:95 +0x7e8 fp=0x2d8ed9c sp=0x2d8ec98 pc=0x6b2168
+github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalBinaryOp(0x0, 0x2998280, 0x206f5e0)
+ /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/eval.go:410 +0x344 fp=0x2d8ee1c sp=0x2d8ed9c pc=0x6b8234
+github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalExprInternal(0x0, 0x2998280, {0xd74640, 0x206f5e0})
+ /home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/eval.go:290 +0x314 fp=0x2d8eefc sp=0x2d8ee1c pc=0x6b6b5c
+github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalExpr(0x0, 0x2998280, {0xd74640, 0x206f5e0})
+...
+```
+
+The program restarted automatically and the next metric retrieval was fine. So there was a bug, but I could live with it (and I did for a time).
+
+## Digging into the code
+
+The panicking part of the code looks at all elements of a slice of []float64 and checks if all of them are not-a-number (NaN). The panic happens in the math.IsNaN function, which is a one-liner: return f != f
+
+Reading a bit about SIGBUS, it means
+
+> that a process is trying to access memory that the CPU cannot physically address
+
+=> https://en.wikipedia.org/wiki/Bus_error Bus error (Wikipedia)
+
+## printf() debugging
+
+Looking up on the Internet, I saw no reported bug that looked like mine and decided to dig further. I made a clone of the VictoriaMetrics repo and tried "The most effective debugging tool" from Brian Kernighan, "Unix for Beginners" (1979).
+
+> The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.
+
+Adding a defer statement did not work, trying to print the value of the float64 also crashed before having the change to have the value printed.
+I also printed the address of the float64: it worked, but I didn't see anything suspect (the address looked legit).
+
+## core dump debugging
+
+Since the bug appeared to be low-level, I thought of generating a core dump. It wasn't trivial, because by default, Gokrazy does not provide shell access out of the box. After installing breakglass and busybox, I was able to get a core dump, by running:
+
+```
+ulimit -c unlimited
+GOTRACEBACK=crash ./victoria-metrics
+```
+
+The hard part was analyzing the core dump. I couldn't easily install a debugger on my Raspberry Pi, so I transferred it to my main machine.
+* Opening it with delve (the debugger of choice for go) did not work, since ARM6 (32 bits) is not a supported platform.
+* Opening it with gdb did not work as well, because of the platform difference
+* Opening it with gdb-multiarch did work! (thanks to the Arch User Repository for an easy installation)
+
+Looking at the stack traces and register values, I wasn't able to find anything interesting (it was even disturbing that the stack trace was not similar to the one printed during the panic).
+
+## dmesg reading
+
+The next day, it occurred to me that the Linux kernel might have more information regarding the crash. I ran dmesg and got a message looking like this (reconstructed from various Internet reports):
+```
+[1141145.918454] Alignment trap: not handling instruction 6b2168 at [<b6fa4b40>]
+[1141145.918455] 8<--- cut here ---
+[1141145.918466] Unhandled fault: alignment exception (0x001) at 0x2eb4dbb
+[1141145.918479] pgd = da1a2e09
+[1141145.918486] [2eb4dbb] *pgd=46c78835, *pte=00000000, *ppte=00000000
+```
+
+I then understood what was going on: the alignment exception was complaining about the address 0x2eb4dbb, which is 48975291 in decimal.
+This number looks odd (pun intended), because a float64 is 8 bytes large, and its address should be a multiple of 8 (which 48975291 is obviously not).
+
+I checked again using printf() debugging and sure enough the panic only happened when the float64 was not a multiple of 8 (i.e, was misaligned).
+
+## golang digging
+
+Now that the bug was identified, I had to understand how it could be that a float64 was misaligned. The Go Specification is very clear, that float64 are guaranteed to be aligned by 8. So I thought that I had found a bug in the Go compiler, how exciting!
+
+=> https://go.dev/ref/spec#Size_and_alignment_guarantees Size and alignment guarantees (The Go Programming Language Specification)
+
+I prepared a detailed bug report and looked around for similar issues. I quickly found another issue for a sigbus fault on ARM.
+
+=> https://github.com/golang/go/issues/38331 cmd/link: sigbus/segfault on ARM using AzCopy
+
+At the end of my bug report, I added a link to this issue and wanted to add a reason why my report was different. I inspected closely the issue and saw that they were using unsafe pointers. Upon writing something like "related issue #38331: unsafe pointers are used", I thought that I should better check that my program wasn't using unsafe pointers...
+
+## VictoriaMetrics debugging
+
+I then saw that the []float64 slice was being constructed using unsafe pointers! I was quite sad that I didn't find a bug in the Go compiler, but quite happy that I could solve the issue myself. Finally I adjusted the code to perform alignment adjustments in case of misalignment, and everything has been running smoothly since then!
+
+=> https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3927 Pull request to fix the alignment panic in VictoriaMetrics
+
+## Conclusion
+
+Next time that I debug a low-level issue, I will remember to look at the kernel messages first (dmesg). It would have spared me a lot of troubles with generating and inspected core dump (which didn't help me in the resolution).
+And be extra careful when using the unsafe package (as the name helpfully reminds).<
\ No newline at end of file
@@ 78,3 78,17 @@ a[href^="gemini:"]:not([data-no-icon])::before {
.flex-auto {
flex: 1 1 auto;
}
+blockquote {
+ font-style: italic;
+ margin-right: 0;
+}
+blockquote::before {
+ content: "\201C";
+ font-size: 4rem;
+ line-height: 0;
+ margin-right: 0.25rem;
+ vertical-align: -1.7rem;
+ font-family: ui-serif, Georgia, Cambria, "Times New Roman", Times, serif;
+ color: #6a6a6a;
+ font-style: normal;
+}
@@ 11,7 11,7 @@
<main class="flex-auto">
<h1>{{ .Title }}</h1>
<small
- >{{ .Date.Format "January 2, 2006" }} on
+ >{{ .Date.Format "2 January 2006" }}, on
<a href="/">Olivier's log</a></small
>
{{ exec "gmnitohtml" .Content | safeHTML }}