46e82e52a96918a11c60df856b039b0ce17b9e28 — oliverpool 12 days ago ccf8f92
sigbus article
3 files changed, 132 insertions(+), 1 deletions(-)

A content/2023-03-09-debugging-sibgus.gmi
M static/html/style.css
M templates/_default/page.html
A content/2023-03-09-debugging-sibgus.gmi => content/2023-03-09-debugging-sibgus.gmi +117 -0
@@ 0,0 1,117 @@
title: Investigating a SIGBUS in a Go program

I have been happily using Gokrazy on an old Raspberry Pi.

=> /2022/go-ing-krazy-raspberry-pi-model-b/ Go(ing)krazy on a Raspberry Pi (Model B)

One of the package that I use is a time series database called VictoriaMetrics (similar to Prometheus).

=> https://victoriametrics.com/

This setup runs pretty well and it gathers every minute:
* temperature in every room (via the FRITZ!DECT 301)
* electricity on some appliance (via Shelly Plug S)
* Internet connectivity statistics (via FRITZ!Box)

However when I repeatedly looked at some metrics (using Grafana), VictoriaMetrics sometimes encountered a fatal error:

fatal error: fault
[signal SIGBUS: bus error code=0x1 addr=0x0 pc=0x6b2168]

goroutine 111 [running]:
runtime.throw({0xa8fb4f, 0x5})
	/usr/lib/go/src/runtime/panic.go:1047 +0x4c fp=0x2d8ec78 sp=0x2d8ec64 pc=0x4dcc0
	/usr/lib/go/src/runtime/signal_unix.go:834 +0xa8 fp=0x2d8ec94 sp=0x2d8ec78 pc=0x67ccc
	/home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/binary_op.go:95 +0x7e8 fp=0x2d8ed9c sp=0x2d8ec98 pc=0x6b2168
github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalBinaryOp(0x0, 0x2998280, 0x206f5e0)
	/home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/eval.go:410 +0x344 fp=0x2d8ee1c sp=0x2d8ed9c pc=0x6b8234
github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalExprInternal(0x0, 0x2998280, {0xd74640, 0x206f5e0})
	/home/olivier/go/pkg/mod/github.com/!victoria!metrics/!victoria!metrics@v1.88.1/app/vmselect/promql/eval.go:290 +0x314 fp=0x2d8eefc sp=0x2d8ee1c pc=0x6b6b5c
github.com/VictoriaMetrics/VictoriaMetrics/app/vmselect/promql.evalExpr(0x0, 0x2998280, {0xd74640, 0x206f5e0})

The program restarted automatically and the next metric retrieval was fine. So there was a bug, but I could live with it (and I did for a time).

## Digging into the code

The panicking part of the code looks at all elements of a slice of []float64 and checks if all of them are not-a-number (NaN). The panic happens in the math.IsNaN function, which is a one-liner: return f != f

Reading a bit about SIGBUS, it means

> that a process is trying to access memory that the CPU cannot physically address

=> https://en.wikipedia.org/wiki/Bus_error Bus error (Wikipedia)

## printf() debugging

Looking up on the Internet, I saw no reported bug that looked like mine and decided to dig further. I made a clone of the VictoriaMetrics repo and tried "The most effective debugging tool" from Brian Kernighan, "Unix for Beginners" (1979).

> The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.

Adding a defer statement did not work, trying to print the value of the float64 also crashed before having the change to have the value printed.
I also printed the address of the float64: it worked, but I didn't see anything suspect (the address looked legit).

## core dump debugging

Since the bug appeared to be low-level, I thought of generating a core dump. It wasn't trivial, because by default, Gokrazy does not provide shell access out of the box. After installing breakglass and busybox, I was able to get a core dump, by running:

ulimit -c unlimited
GOTRACEBACK=crash ./victoria-metrics

The hard part was analyzing the core dump. I couldn't easily install a debugger on my Raspberry Pi, so I transferred it to my main machine.
* Opening it with delve (the debugger of choice for go) did not work, since ARM6 (32 bits) is not a supported platform.
* Opening it with gdb did not work as well, because of the platform difference
* Opening it with gdb-multiarch did work! (thanks to the Arch User Repository for an easy installation)

Looking at the stack traces and register values, I wasn't able to find anything interesting (it was even disturbing that the stack trace was not similar to the one printed during the panic).

## dmesg reading

The next day, it occurred to me that the Linux kernel might have more information regarding the crash. I ran dmesg and got a message looking like this (reconstructed from various Internet reports):
[1141145.918454] Alignment trap: not handling instruction 6b2168 at [<b6fa4b40>]
[1141145.918455] 8<--- cut here ---
[1141145.918466] Unhandled fault: alignment exception (0x001) at 0x2eb4dbb
[1141145.918479] pgd = da1a2e09
[1141145.918486] [2eb4dbb] *pgd=46c78835, *pte=00000000, *ppte=00000000

I then understood what was going on: the alignment exception was complaining about the address 0x2eb4dbb, which is 48975291 in decimal.
This number looks odd (pun intended), because a float64 is 8 bytes large, and its address should be a multiple of 8 (which 48975291 is obviously not).

I checked again using printf() debugging and sure enough the panic only happened when the float64 was not a multiple of 8 (i.e, was misaligned).

## golang digging

Now that the bug was identified, I had to understand how it could be that a float64 was misaligned. The Go Specification is very clear, that float64 are guaranteed to be aligned by 8. So I thought that I had found a bug in the Go compiler, how exciting!

=> https://go.dev/ref/spec#Size_and_alignment_guarantees Size and alignment guarantees (The Go Programming Language Specification)

I prepared a detailed bug report and looked around for similar issues. I quickly found another issue for a sigbus fault on ARM.

=> https://github.com/golang/go/issues/38331 cmd/link: sigbus/segfault on ARM using AzCopy

At the end of my bug report, I added a link to this issue and wanted to add a reason why my report was different. I inspected closely the issue and saw that they were using unsafe pointers. Upon writing something like "related issue #38331: unsafe pointers are used", I thought that I should better check that my program wasn't using unsafe pointers...

## VictoriaMetrics debugging

I then saw that the []float64 slice was being constructed using unsafe pointers! I was quite sad that I didn't find a bug in the Go compiler, but quite happy that I could solve the issue myself. Finally I adjusted the code to perform alignment adjustments in case of misalignment, and everything has been running smoothly since then!

=> https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3927 Pull request to fix the alignment panic in VictoriaMetrics

## Conclusion

Next time that I debug a low-level issue, I will remember to look at the kernel messages first (dmesg). It would have spared me a lot of troubles with generating and inspected core dump (which didn't help me in the resolution).
And be extra careful when using the unsafe package (as the name helpfully reminds).
\ No newline at end of file

M static/html/style.css => static/html/style.css +14 -0
@@ 78,3 78,17 @@ a[href^="gemini:"]:not([data-no-icon])::before {
.flex-auto {
  flex: 1 1 auto;
blockquote {
  font-style: italic;
  margin-right: 0;
blockquote::before {
  content: "\201C";
  font-size: 4rem;
  line-height: 0;
  margin-right: 0.25rem;
  vertical-align: -1.7rem;
  font-family: ui-serif, Georgia, Cambria, "Times New Roman", Times, serif;
  color: #6a6a6a;
  font-style: normal;

M templates/_default/page.html => templates/_default/page.html +1 -1
@@ 11,7 11,7 @@
    <main class="flex-auto">
      <h1>{{ .Title }}</h1>
        >{{ .Date.Format "January 2, 2006" }} on
        >{{ .Date.Format "2 January 2006" }}, on
        <a href="/">Olivier's log</a></small
      {{ exec "gmnitohtml" .Content | safeHTML }}