runtime: increase timeout in TestSpuriousWakeupsNeverHangSemasleep
This change tries increasing the timeout in
TestSpuriousWakeupsNeverHangSemasleep. I'm not entirely sure of the
mechanism, but GODEBUG=gcstoptheworld=2 and GODEBUG=gccheckmark=1 can
cause this test to fail at it's regular timeout. It does not seem to
indicate a deadlock, because bumping the timeout 10x make the problem
go away. I suspect the problem is due to the long STW times these two
modes can induce, plus the fact this test runs in parallel with others.
Let's just bump the timeout. The test is fundamentally sound, and it's
unclear to me how else to test for a deadlock here.
Fixes #71691.
Fixes #71548.
Change-Id: I649531eeec8a8408ba90823ce5223f3a17863124
Reviewed-on: https://go-review.googlesource.com/c/go/+/652756
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
net/http: reject newlines in chunk-size lines
Unlike request headers, where we are allowed to leniently accept
a bare LF in place of a CRLF, chunked bodies must always use CRLF
line terminators. We were already enforcing this for chunk-data lines;
do so for chunk-size lines as well. Also reject bare CRs anywhere
other than as part of the CRLF terminator.
Fixes CVE-2025-22871
Fixes #71988
Change-Id: Ib0e21af5a8ba28c2a1ca52b72af8e2265ec79e4a
Reviewed-on: https://go-review.googlesource.com/c/go/+/652998
Reviewed-by: Jonathan Amsterdam <jba@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
math/big: optimize atoi of base 2, 4, 16
Avoid multiplies when converting base 2, 4, 16 inputs,
reducing conversion time from O(N²) to O(N).
The Base8 and Base10 code paths should be unmodified,
but the base-2,4,16 changes tickle the compiler to generate
better (amd64) or worse (arm64) when really it should not.
This is described in detail in #71868 and should be ignored
for the purposes of this CL.
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-16 324.4n ± 0% 258.7n ± 0% -20.25% (p=0.000 n=15)
Scan/100/Base2-16 2.376µ ± 0% 1.968µ ± 0% -17.17% (p=0.000 n=15)
Scan/1000/Base2-16 23.89µ ± 0% 19.16µ ± 0% -19.80% (p=0.000 n=15)
Scan/10000/Base2-16 311.5µ ± 0% 190.4µ ± 0% -38.86% (p=0.000 n=15)
Scan/100000/Base2-16 10.508m ± 0% 1.904m ± 0% -81.88% (p=0.000 n=15)
Scan/10/Base8-16 138.3n ± 0% 127.9n ± 0% -7.52% (p=0.000 n=15)
Scan/100/Base8-16 886.1n ± 0% 790.2n ± 0% -10.82% (p=0.000 n=15)
Scan/1000/Base8-16 9.227µ ± 0% 8.234µ ± 0% -10.76% (p=0.000 n=15)
Scan/10000/Base8-16 165.8µ ± 0% 155.6µ ± 0% -6.19% (p=0.000 n=15)
Scan/100000/Base8-16 9.044m ± 0% 8.935m ± 0% -1.20% (p=0.000 n=15)
Scan/10/Base10-16 129.9n ± 0% 120.0n ± 0% -7.62% (p=0.000 n=15)
Scan/100/Base10-16 816.3n ± 0% 730.0n ± 0% -10.57% (p=0.000 n=15)
Scan/1000/Base10-16 8.518µ ± 0% 7.628µ ± 0% -10.45% (p=0.000 n=15)
Scan/10000/Base10-16 158.6µ ± 0% 149.4µ ± 0% -5.80% (p=0.000 n=15)
Scan/100000/Base10-16 8.962m ± 0% 8.855m ± 0% -1.20% (p=0.000 n=15)
Scan/10/Base16-16 114.5n ± 0% 108.6n ± 0% -5.15% (p=0.000 n=15)
Scan/100/Base16-16 648.3n ± 0% 525.0n ± 0% -19.02% (p=0.000 n=15)
Scan/1000/Base16-16 7.375µ ± 0% 5.636µ ± 0% -23.58% (p=0.000 n=15)
Scan/10000/Base16-16 171.18µ ± 0% 66.99µ ± 0% -60.87% (p=0.000 n=15)
Scan/100000/Base16-16 9490.9µ ± 0% 682.8µ ± 0% -92.81% (p=0.000 n=15)
geomean 20.11µ 13.69µ -31.94%
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-88 275.4n ± 0% 215.0n ± 0% -21.93% (p=0.000 n=15)
Scan/100/Base2-88 1.869µ ± 0% 1.629µ ± 0% -12.84% (p=0.000 n=15)
Scan/1000/Base2-88 18.56µ ± 0% 15.81µ ± 0% -14.82% (p=0.000 n=15)
Scan/10000/Base2-88 270.0µ ± 0% 157.2µ ± 0% -41.77% (p=0.000 n=15)
Scan/100000/Base2-88 11.518m ± 0% 1.571m ± 0% -86.36% (p=0.000 n=15)
Scan/10/Base8-88 108.9n ± 0% 106.0n ± 0% -2.66% (p=0.000 n=15)
Scan/100/Base8-88 655.2n ± 0% 594.9n ± 0% -9.20% (p=0.000 n=15)
Scan/1000/Base8-88 6.467µ ± 0% 5.966µ ± 0% -7.75% (p=0.000 n=15)
Scan/10000/Base8-88 151.2µ ± 0% 147.4µ ± 0% -2.53% (p=0.000 n=15)
Scan/100000/Base8-88 10.33m ± 0% 10.30m ± 0% -0.25% (p=0.000 n=15)
Scan/10/Base10-88 100.20n ± 0% 98.53n ± 0% -1.67% (p=0.000 n=15)
Scan/100/Base10-88 596.9n ± 0% 543.3n ± 0% -8.98% (p=0.000 n=15)
Scan/1000/Base10-88 5.904µ ± 0% 5.485µ ± 0% -7.10% (p=0.000 n=15)
Scan/10000/Base10-88 145.7µ ± 0% 142.0µ ± 0% -2.55% (p=0.000 n=15)
Scan/100000/Base10-88 10.26m ± 0% 10.24m ± 0% -0.18% (p=0.000 n=15)
Scan/10/Base16-88 90.33n ± 0% 87.60n ± 0% -3.02% (p=0.000 n=15)
Scan/100/Base16-88 506.4n ± 0% 437.7n ± 0% -13.57% (p=0.000 n=15)
Scan/1000/Base16-88 5.056µ ± 0% 4.007µ ± 0% -20.75% (p=0.000 n=15)
Scan/10000/Base16-88 163.35µ ± 0% 65.37µ ± 0% -59.98% (p=0.000 n=15)
Scan/100000/Base16-88 11027.2µ ± 0% 735.1µ ± 0% -93.33% (p=0.000 n=15)
geomean 17.13µ 11.74µ -31.46%
goos: linux
goarch: arm64
pkg: math/big
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-16 324.7n ± 0% 348.4n ± 0% +7.30% (p=0.000 n=15)
Scan/100/Base2-16 2.604µ ± 0% 3.031µ ± 0% +16.40% (p=0.000 n=15)
Scan/1000/Base2-16 26.15µ ± 0% 29.94µ ± 0% +14.52% (p=0.000 n=15)
Scan/10000/Base2-16 334.3µ ± 0% 298.8µ ± 0% -10.64% (p=0.000 n=15)
Scan/100000/Base2-16 10.664m ± 0% 2.991m ± 0% -71.95% (p=0.000 n=15)
Scan/10/Base8-16 144.4n ± 1% 162.2n ± 1% +12.33% (p=0.000 n=15)
Scan/100/Base8-16 917.2n ± 0% 1084.0n ± 0% +18.19% (p=0.000 n=15)
Scan/1000/Base8-16 9.367µ ± 0% 10.901µ ± 0% +16.38% (p=0.000 n=15)
Scan/10000/Base8-16 164.2µ ± 0% 181.2µ ± 0% +10.34% (p=0.000 n=15)
Scan/100000/Base8-16 8.871m ± 1% 9.140m ± 0% +3.04% (p=0.000 n=15)
Scan/10/Base10-16 134.6n ± 1% 148.3n ± 1% +10.18% (p=0.000 n=15)
Scan/100/Base10-16 837.1n ± 0% 986.6n ± 0% +17.86% (p=0.000 n=15)
Scan/1000/Base10-16 8.563µ ± 0% 9.936µ ± 0% +16.03% (p=0.000 n=15)
Scan/10000/Base10-16 156.5µ ± 1% 171.3µ ± 0% +9.41% (p=0.000 n=15)
Scan/100000/Base10-16 8.863m ± 1% 9.011m ± 0% +1.66% (p=0.000 n=15)
Scan/10/Base16-16 115.7n ± 2% 129.1n ± 1% +11.58% (p=0.000 n=15)
Scan/100/Base16-16 708.6n ± 0% 796.8n ± 0% +12.45% (p=0.000 n=15)
Scan/1000/Base16-16 7.314µ ± 0% 7.554µ ± 0% +3.28% (p=0.000 n=15)
Scan/10000/Base16-16 149.05µ ± 0% 74.60µ ± 0% -49.95% (p=0.000 n=15)
Scan/100000/Base16-16 9091.6µ ± 0% 741.5µ ± 0% -91.84% (p=0.000 n=15)
geomean 20.39µ 17.65µ -13.44%
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-12 193.8n ± 2% 157.3n ± 1% -18.83% (p=0.000 n=15)
Scan/100/Base2-12 1.445µ ± 2% 1.362µ ± 1% -5.74% (p=0.000 n=15)
Scan/1000/Base2-12 14.28µ ± 0% 13.51µ ± 0% -5.42% (p=0.000 n=15)
Scan/10000/Base2-12 177.1µ ± 0% 134.6µ ± 0% -24.04% (p=0.000 n=15)
Scan/100000/Base2-12 5.429m ± 1% 1.333m ± 0% -75.45% (p=0.000 n=15)
Scan/10/Base8-12 75.52n ± 2% 76.09n ± 1% ~ (p=0.010 n=15)
Scan/100/Base8-12 528.4n ± 1% 532.1n ± 1% ~ (p=0.003 n=15)
Scan/1000/Base8-12 5.423µ ± 1% 5.427µ ± 0% ~ (p=0.183 n=15)
Scan/10000/Base8-12 89.26µ ± 1% 89.37µ ± 0% ~ (p=0.237 n=15)
Scan/100000/Base8-12 4.543m ± 2% 4.560m ± 1% ~ (p=0.595 n=15)
Scan/10/Base10-12 69.87n ± 1% 70.51n ± 0% ~ (p=0.002 n=15)
Scan/100/Base10-12 488.4n ± 1% 491.2n ± 0% ~ (p=0.060 n=15)
Scan/1000/Base10-12 5.014µ ± 1% 5.008µ ± 0% ~ (p=0.783 n=15)
Scan/10000/Base10-12 84.90µ ± 0% 85.10µ ± 0% ~ (p=0.109 n=15)
Scan/100000/Base10-12 4.516m ± 1% 4.521m ± 1% ~ (p=0.713 n=15)
Scan/10/Base16-12 59.21n ± 1% 57.70n ± 1% -2.55% (p=0.000 n=15)
Scan/100/Base16-12 380.0n ± 1% 360.7n ± 1% -5.08% (p=0.000 n=15)
Scan/1000/Base16-12 3.775µ ± 0% 3.421µ ± 0% -9.38% (p=0.000 n=15)
Scan/10000/Base16-12 80.62µ ± 0% 34.44µ ± 1% -57.28% (p=0.000 n=15)
Scan/100000/Base16-12 4826.4µ ± 2% 450.9µ ± 2% -90.66% (p=0.000 n=15)
geomean 11.05µ 8.448µ -23.52%
Change-Id: Ifdb2049545f34072aa75cdbb72bed4cf465f0ad7
Reviewed-on: https://go-review.googlesource.com/c/go/+/650640
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Griesemer <gri@google.com>
math/big: improve scan test and benchmark
Add a few more test cases for scanning (integer conversion),
which were helpful in debugging some upcoming changes.
BenchmarkScan currently times converting the value 10**N
represented in base B back into []Word form.
When B = 10, the text is 1 followed by many zeros, which
could hit a "multiply by zero" special case when processing
many digit chunks, misrepresenting the actual time required
depending on whether that case is optimized.
Change the benchmark to use 9**N, which is about as big and
will not cause runs of zeros in any of the tested bases.
The benchmark comparison below is not showing faster code,
since of course the code is not changing at all here. Instead,
it is showing that the new benchmark work is roughly the same
size as the old benchmark work.
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
ScanPi-12 43.35µ ± 1% 43.59µ ± 1% ~ (p=0.069 n=15)
Scan/10/Base2-12 202.3n ± 2% 193.7n ± 1% -4.25% (p=0.000 n=15)
Scan/100/Base2-12 1.512µ ± 3% 1.447µ ± 1% -4.30% (p=0.000 n=15)
Scan/1000/Base2-12 15.06µ ± 2% 14.33µ ± 0% -4.83% (p=0.000 n=15)
Scan/10000/Base2-12 188.0µ ± 5% 177.3µ ± 1% -5.65% (p=0.000 n=15)
Scan/100000/Base2-12 5.814m ± 3% 5.382m ± 1% -7.43% (p=0.000 n=15)
Scan/10/Base8-12 78.57n ± 2% 75.02n ± 1% -4.52% (p=0.000 n=15)
Scan/100/Base8-12 548.2n ± 2% 526.8n ± 1% -3.90% (p=0.000 n=15)
Scan/1000/Base8-12 5.674µ ± 2% 5.421µ ± 0% -4.46% (p=0.000 n=15)
Scan/10000/Base8-12 94.42µ ± 1% 88.61µ ± 1% -6.15% (p=0.000 n=15)
Scan/100000/Base8-12 4.906m ± 2% 4.498m ± 3% -8.31% (p=0.000 n=15)
Scan/10/Base10-12 73.42n ± 1% 69.56n ± 0% -5.26% (p=0.000 n=15)
Scan/100/Base10-12 511.9n ± 1% 488.2n ± 0% -4.63% (p=0.000 n=15)
Scan/1000/Base10-12 5.254µ ± 2% 5.009µ ± 0% -4.66% (p=0.000 n=15)
Scan/10000/Base10-12 90.22µ ± 2% 84.52µ ± 0% -6.32% (p=0.000 n=15)
Scan/100000/Base10-12 4.842m ± 3% 4.471m ± 3% -7.65% (p=0.000 n=15)
Scan/10/Base16-12 62.28n ± 1% 58.70n ± 1% -5.75% (p=0.000 n=15)
Scan/100/Base16-12 398.6n ± 0% 377.9n ± 1% -5.19% (p=0.000 n=15)
Scan/1000/Base16-12 4.108µ ± 1% 3.782µ ± 0% -7.94% (p=0.000 n=15)
Scan/10000/Base16-12 83.78µ ± 2% 80.51µ ± 1% -3.90% (p=0.000 n=15)
Scan/100000/Base16-12 5.080m ± 3% 4.698m ± 3% -7.53% (p=0.000 n=15)
geomean 12.41µ 11.74µ -5.36%
Change-Id: If3ce290ecc7f38672f11b42fd811afb53dee665d
Reviewed-on: https://go-review.googlesource.com/c/go/+/650639
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
math/big: replace nat pool with Word stack
In the early days of math/big, algorithms that needed more space
grew the result larger than it needed to be and then used the
high words as extra space. This made results their own temporary
space caches, at the cost that saving a result in a data structure
might hold significantly more memory than necessary.
Specifically, new(big.Int).Mul(x, y) returned a big.Int with a
backing slice 3X as big as it strictly needed to be.
If you are storing many multiplication results, or even a single
large result, the 3X overhead can add up.
This approach to storage for temporaries also requires being able
to analyze the algorithms to predict the exact amount they need,
which can be difficult.
For both these reasons, the implementation of recursive long division,
which came later, introduced a “nat pool” where temporaries could be
stored and reused, or reclaimed by the GC when no longer used.
This avoids the storage and bookkeeping overheads but introduces a
per-temporary sync.Pool overhead. divRecursiveStep takes an array
of cached temporaries to remove some of that overhead.
The nat pool was better but is still not quite right.
This CL introduces something even better than the nat pool
(still probably not quite right, but the best I can see for now):
a sync.Pool holding stacks for allocating temporaries.
Now an operation can get one stack out of the pool and then
allocate as many temporaries as it needs during the operation,
eventually returning the stack back to the pool. The sync.Pool
operations are now per-exported-operation (like big.Int.Mul),
not per-temporary.
This CL converts both the pre-allocation in nat.mul and the
uses of the nat pool to use stack pools instead. This simplifies
some code and sets us up better for more complex algorithms
(such as Toom-Cook or FFT-based multiplication) that need
more temporaries. It is also a little bit faster.
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15)
Div/40/20-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15)
Div/100/50-16 56.65n ± 0% 55.53n ± 0% -1.98% (p=0.000 n=15)
Div/200/100-16 194.6n ± 1% 172.8n ± 0% -11.20% (p=0.000 n=15)
Div/400/200-16 232.1n ± 0% 206.7n ± 0% -10.94% (p=0.000 n=15)
Div/1000/500-16 405.3n ± 1% 383.8n ± 0% -5.30% (p=0.000 n=15)
Div/2000/1000-16 810.4n ± 1% 795.2n ± 0% -1.88% (p=0.000 n=15)
Div/20000/10000-16 25.88µ ± 0% 25.39µ ± 0% -1.89% (p=0.000 n=15)
Div/200000/100000-16 931.5µ ± 0% 924.3µ ± 0% -0.77% (p=0.000 n=15)
Div/2000000/1000000-16 37.77m ± 0% 37.75m ± 0% ~ (p=0.098 n=15)
Div/20000000/10000000-16 1.367 ± 0% 1.377 ± 0% +0.72% (p=0.003 n=15)
NatMul/10-16 168.5n ± 3% 164.0n ± 4% ~ (p=0.751 n=15)
NatMul/100-16 6.086µ ± 3% 5.380µ ± 3% -11.60% (p=0.000 n=15)
NatMul/1000-16 238.1µ ± 3% 228.3µ ± 1% -4.12% (p=0.000 n=15)
NatMul/10000-16 8.721m ± 2% 8.518m ± 1% -2.33% (p=0.000 n=15)
NatMul/100000-16 369.6m ± 0% 371.1m ± 0% +0.42% (p=0.000 n=15)
geomean 19.57µ 18.74µ -4.21%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15)
NatMul/1000-16 48.16Ki ± 0% 16.02Ki ± 0% -66.73% (p=0.000 n=15)
NatMul/10000-16 482.9Ki ± 1% 165.4Ki ± 3% -65.75% (p=0.000 n=15)
NatMul/100000-16 5.747Mi ± 7% 4.197Mi ± 0% -26.97% (p=0.000 n=15)
geomean 41.42Ki 20.63Ki -50.18%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100000-16 7.000 ± 14% 7.000 ± 14% ~ (p=0.668 n=15)
geomean 1.476 1.476 +0.00%
¹ all samples are equal
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-88 15.84n ± 1% 13.12n ± 0% -17.17% (p=0.000 n=15)
Div/40/20-88 15.88n ± 1% 13.12n ± 0% -17.38% (p=0.000 n=15)
Div/100/50-88 26.42n ± 0% 25.47n ± 0% -3.60% (p=0.000 n=15)
Div/200/100-88 132.4n ± 0% 114.9n ± 0% -13.22% (p=0.000 n=15)
Div/400/200-88 150.1n ± 0% 135.6n ± 0% -9.66% (p=0.000 n=15)
Div/1000/500-88 275.5n ± 0% 264.1n ± 0% -4.14% (p=0.000 n=15)
Div/2000/1000-88 586.5n ± 0% 581.1n ± 0% -0.92% (p=0.000 n=15)
Div/20000/10000-88 25.87µ ± 0% 25.72µ ± 0% -0.59% (p=0.000 n=15)
Div/200000/100000-88 772.2µ ± 0% 779.0µ ± 0% +0.88% (p=0.000 n=15)
Div/2000000/1000000-88 33.36m ± 0% 33.63m ± 0% +0.80% (p=0.000 n=15)
Div/20000000/10000000-88 1.307 ± 0% 1.320 ± 0% +1.03% (p=0.000 n=15)
NatMul/10-88 140.4n ± 0% 148.8n ± 4% +5.98% (p=0.000 n=15)
NatMul/100-88 4.663µ ± 1% 4.388µ ± 1% -5.90% (p=0.000 n=15)
NatMul/1000-88 207.7µ ± 0% 205.8µ ± 0% -0.89% (p=0.000 n=15)
NatMul/10000-88 8.456m ± 0% 8.468m ± 0% +0.14% (p=0.021 n=15)
NatMul/100000-88 295.1m ± 0% 297.9m ± 0% +0.94% (p=0.000 n=15)
geomean 14.96µ 14.33µ -4.23%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-88 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-88 4.750Ki ± 0% 1.758Ki ± 0% -62.99% (p=0.000 n=15)
NatMul/1000-88 48.44Ki ± 0% 16.08Ki ± 0% -66.80% (p=0.000 n=15)
NatMul/10000-88 489.7Ki ± 1% 166.1Ki ± 3% -66.08% (p=0.000 n=15)
NatMul/100000-88 5.546Mi ± 0% 3.819Mi ± 60% -31.15% (p=0.000 n=15)
geomean 41.29Ki 20.30Ki -50.85%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100000-88 5.000 ± 20% 6.000 ± 67% ~ (p=0.672 n=15)
geomean 1.380 1.431 +3.71%
¹ all samples are equal
goos: linux
goarch: arm64
pkg: math/big
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-16 15.85n ± 0% 15.23n ± 0% -3.91% (p=0.000 n=15)
Div/40/20-16 15.88n ± 0% 15.22n ± 0% -4.16% (p=0.000 n=15)
Div/100/50-16 29.69n ± 0% 26.39n ± 0% -11.11% (p=0.000 n=15)
Div/200/100-16 149.2n ± 0% 123.3n ± 0% -17.36% (p=0.000 n=15)
Div/400/200-16 160.3n ± 0% 139.2n ± 0% -13.16% (p=0.000 n=15)
Div/1000/500-16 271.0n ± 0% 256.1n ± 0% -5.50% (p=0.000 n=15)
Div/2000/1000-16 545.3n ± 0% 527.0n ± 0% -3.36% (p=0.000 n=15)
Div/20000/10000-16 22.60µ ± 0% 22.20µ ± 0% -1.77% (p=0.000 n=15)
Div/200000/100000-16 889.0µ ± 0% 892.2µ ± 0% +0.35% (p=0.000 n=15)
Div/2000000/1000000-16 38.01m ± 0% 38.12m ± 0% +0.30% (p=0.000 n=15)
Div/20000000/10000000-16 1.437 ± 0% 1.444 ± 0% +0.50% (p=0.000 n=15)
NatMul/10-16 166.4n ± 2% 169.5n ± 1% +1.86% (p=0.000 n=15)
NatMul/100-16 5.733µ ± 1% 5.570µ ± 1% -2.84% (p=0.000 n=15)
NatMul/1000-16 232.6µ ± 1% 229.8µ ± 0% -1.22% (p=0.000 n=15)
NatMul/10000-16 9.039m ± 1% 8.969m ± 0% -0.77% (p=0.000 n=15)
NatMul/100000-16 367.0m ± 0% 368.8m ± 0% +0.48% (p=0.000 n=15)
geomean 16.15µ 15.50µ -4.01%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15)
NatMul/1000-16 48.33Ki ± 0% 16.02Ki ± 0% -66.85% (p=0.000 n=15)
NatMul/10000-16 536.5Ki ± 1% 165.7Ki ± 3% -69.12% (p=0.000 n=15)
NatMul/100000-16 6.078Mi ± 6% 4.197Mi ± 0% -30.94% (p=0.000 n=15)
geomean 42.81Ki 20.64Ki -51.78%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-16 2.000 ± 50% 1.000 ± 0% -50.00% (p=0.001 n=15)
NatMul/100000-16 9.000 ± 11% 8.000 ± 12% -11.11% (p=0.001 n=15)
geomean 1.783 1.516 -14.97%
¹ all samples are equal
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-12 9.850n ± 1% 9.405n ± 1% -4.52% (p=0.000 n=15)
Div/40/20-12 9.858n ± 0% 9.403n ± 1% -4.62% (p=0.000 n=15)
Div/100/50-12 16.40n ± 1% 14.81n ± 0% -9.70% (p=0.000 n=15)
Div/200/100-12 88.48n ± 2% 80.88n ± 0% -8.59% (p=0.000 n=15)
Div/400/200-12 107.90n ± 1% 99.28n ± 1% -7.99% (p=0.000 n=15)
Div/1000/500-12 188.8n ± 1% 178.6n ± 1% -5.40% (p=0.000 n=15)
Div/2000/1000-12 399.9n ± 0% 389.1n ± 0% -2.70% (p=0.000 n=15)
Div/20000/10000-12 13.94µ ± 2% 13.81µ ± 1% ~ (p=0.574 n=15)
Div/200000/100000-12 523.8µ ± 0% 521.7µ ± 0% -0.40% (p=0.000 n=15)
Div/2000000/1000000-12 21.46m ± 0% 21.48m ± 0% ~ (p=0.067 n=15)
Div/20000000/10000000-12 812.5m ± 0% 812.9m ± 0% ~ (p=0.061 n=15)
NatMul/10-12 77.14n ± 0% 78.35n ± 1% +1.57% (p=0.000 n=15)
NatMul/100-12 2.999µ ± 0% 2.871µ ± 1% -4.27% (p=0.000 n=15)
NatMul/1000-12 126.2µ ± 0% 126.8µ ± 0% +0.51% (p=0.011 n=15)
NatMul/10000-12 5.099m ± 0% 5.125m ± 0% +0.51% (p=0.000 n=15)
NatMul/100000-12 206.7m ± 0% 208.4m ± 0% +0.80% (p=0.000 n=15)
geomean 9.512µ 9.236µ -2.91%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-12 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-12 4.750Ki ± 0% 1.750Ki ± 0% -63.16% (p=0.000 n=15)
NatMul/1000-12 48.13Ki ± 0% 16.01Ki ± 0% -66.73% (p=0.000 n=15)
NatMul/10000-12 483.5Ki ± 1% 163.2Ki ± 2% -66.24% (p=0.000 n=15)
NatMul/100000-12 5.480Mi ± 4% 1.532Mi ± 104% -72.05% (p=0.000 n=15)
geomean 41.03Ki 16.82Ki -59.01%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100000-12 5.000 ± 0% 1.000 ± 400% -80.00% (p=0.007 n=15)
geomean 1.380 1.000 -27.52%
¹ all samples are equal
Change-Id: I7efa6fe37971ed26ae120a32250fcb47ece0a011
Reviewed-on: https://go-review.googlesource.com/c/go/+/650638
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
math/big: clean up GCD a little
The GCD code was setting one *Int to the value of another
by smashing one struct on top of the other, instead of using Set.
That was safe in this one case, but it's not idiomatic in math/big
nor safe in general, so rewrite the code not to do that.
(In one case, by swapping variables around; in another, by calling Set.)
The added Set call does slow down GCDs by a small amount,
since the answer has to be copied out. To compensate for that,
optimize a bit: remove the s, t temporaries entirely and handle
vector x word multiplication directly. The net result is that almost
all GCDs are faster, except for small ones, which are a few
nanoseconds slower.
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ bench.before │ bench.after │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-12 23.80n ± 1% 31.71n ± 1% +33.24% (p=0.000 n=10)
GCD10x10/WithXY-12 100.40n ± 0% 92.14n ± 1% -8.22% (p=0.000 n=10)
GCD10x100/WithoutXY-12 63.70n ± 0% 70.73n ± 0% +11.05% (p=0.000 n=10)
GCD10x100/WithXY-12 278.6n ± 0% 233.1n ± 1% -16.35% (p=0.000 n=10)
GCD10x1000/WithoutXY-12 153.4n ± 0% 162.2n ± 1% +5.74% (p=0.000 n=10)
GCD10x1000/WithXY-12 456.0n ± 0% 411.8n ± 1% -9.69% (p=0.000 n=10)
GCD10x10000/WithoutXY-12 1.002µ ± 1% 1.036µ ± 0% +3.39% (p=0.000 n=10)
GCD10x10000/WithXY-12 2.330µ ± 1% 2.210µ ± 0% -5.13% (p=0.000 n=10)
GCD10x100000/WithoutXY-12 8.894µ ± 0% 8.889µ ± 1% ~ (p=0.754 n=10)
GCD10x100000/WithXY-12 20.84µ ± 0% 20.24µ ± 0% -2.84% (p=0.000 n=10)
GCD100x100/WithoutXY-12 373.3n ± 3% 314.4n ± 0% -15.76% (p=0.000 n=10)
GCD100x100/WithXY-12 662.5n ± 0% 572.4n ± 1% -13.59% (p=0.000 n=10)
GCD100x1000/WithoutXY-12 641.8n ± 0% 598.1n ± 1% -6.81% (p=0.000 n=10)
GCD100x1000/WithXY-12 1.123µ ± 0% 1.019µ ± 1% -9.26% (p=0.000 n=10)
GCD100x10000/WithoutXY-12 2.870µ ± 0% 2.831µ ± 0% -1.38% (p=0.000 n=10)
GCD100x10000/WithXY-12 4.930µ ± 1% 4.675µ ± 0% -5.16% (p=0.000 n=10)
GCD100x100000/WithoutXY-12 24.08µ ± 0% 23.97µ ± 0% -0.48% (p=0.007 n=10)
GCD100x100000/WithXY-12 43.66µ ± 0% 42.52µ ± 0% -2.61% (p=0.001 n=10)
GCD1000x1000/WithoutXY-12 3.999µ ± 0% 3.569µ ± 1% -10.75% (p=0.000 n=10)
GCD1000x1000/WithXY-12 6.397µ ± 0% 5.534µ ± 0% -13.49% (p=0.000 n=10)
GCD1000x10000/WithoutXY-12 6.875µ ± 0% 6.450µ ± 0% -6.18% (p=0.000 n=10)
GCD1000x10000/WithXY-12 20.75µ ± 1% 19.17µ ± 1% -7.64% (p=0.000 n=10)
GCD1000x100000/WithoutXY-12 36.38µ ± 0% 35.60µ ± 1% -2.13% (p=0.000 n=10)
GCD1000x100000/WithXY-12 172.1µ ± 0% 174.4µ ± 3% ~ (p=0.052 n=10)
GCD10000x10000/WithoutXY-12 79.89µ ± 1% 75.16µ ± 2% -5.92% (p=0.000 n=10)
GCD10000x10000/WithXY-12 160.1µ ± 0% 150.0µ ± 0% -6.33% (p=0.000 n=10)
GCD10000x100000/WithoutXY-12 213.2µ ± 1% 209.0µ ± 1% -1.98% (p=0.000 n=10)
GCD10000x100000/WithXY-12 1.399m ± 0% 1.342m ± 3% -4.08% (p=0.002 n=10)
GCD100000x100000/WithoutXY-12 5.463m ± 1% 5.504m ± 2% ~ (p=0.190 n=10)
GCD100000x100000/WithXY-12 11.36m ± 0% 11.46m ± 1% +0.86% (p=0.000 n=10)
geomean 6.953µ 6.695µ -3.71%
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bench.before │ bench.after │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-32 39.66n ± 4% 44.34n ± 4% +11.77% (p=0.000 n=10)
GCD10x10/WithXY-32 156.7n ± 12% 130.8n ± 2% -16.53% (p=0.000 n=10)
GCD10x100/WithoutXY-32 115.8n ± 5% 120.2n ± 2% +3.89% (p=0.000 n=10)
GCD10x100/WithXY-32 465.3n ± 3% 368.1n ± 2% -20.91% (p=0.000 n=10)
GCD10x1000/WithoutXY-32 201.1n ± 1% 210.8n ± 2% +4.82% (p=0.000 n=10)
GCD10x1000/WithXY-32 652.9n ± 4% 605.0n ± 1% -7.32% (p=0.002 n=10)
GCD10x10000/WithoutXY-32 1.046µ ± 2% 1.143µ ± 1% +9.33% (p=0.000 n=10)
GCD10x10000/WithXY-32 3.360µ ± 1% 3.258µ ± 1% -3.04% (p=0.000 n=10)
GCD10x100000/WithoutXY-32 9.391µ ± 3% 9.997µ ± 1% +6.46% (p=0.000 n=10)
GCD10x100000/WithXY-32 27.92µ ± 1% 28.21µ ± 0% +1.04% (p=0.043 n=10)
GCD100x100/WithoutXY-32 443.7n ± 5% 320.0n ± 2% -27.88% (p=0.000 n=10)
GCD100x100/WithXY-32 789.9n ± 2% 690.4n ± 1% -12.60% (p=0.000 n=10)
GCD100x1000/WithoutXY-32 718.4n ± 3% 600.0n ± 1% -16.48% (p=0.000 n=10)
GCD100x1000/WithXY-32 1.388µ ± 4% 1.175µ ± 1% -15.28% (p=0.000 n=10)
GCD100x10000/WithoutXY-32 2.750µ ± 1% 2.668µ ± 1% -2.96% (p=0.000 n=10)
GCD100x10000/WithXY-32 6.016µ ± 1% 5.590µ ± 1% -7.09% (p=0.000 n=10)
GCD100x100000/WithoutXY-32 21.40µ ± 1% 22.30µ ± 1% +4.21% (p=0.000 n=10)
GCD100x100000/WithXY-32 47.02µ ± 4% 48.80µ ± 0% +3.78% (p=0.015 n=10)
GCD1000x1000/WithoutXY-32 3.417µ ± 4% 3.020µ ± 1% -11.65% (p=0.000 n=10)
GCD1000x1000/WithXY-32 5.752µ ± 0% 5.418µ ± 2% -5.81% (p=0.000 n=10)
GCD1000x10000/WithoutXY-32 6.150µ ± 0% 6.246µ ± 1% +1.55% (p=0.000 n=10)
GCD1000x10000/WithXY-32 24.68µ ± 3% 25.07µ ± 1% ~ (p=0.051 n=10)
GCD1000x100000/WithoutXY-32 34.60µ ± 2% 36.85µ ± 1% +6.51% (p=0.000 n=10)
GCD1000x100000/WithXY-32 209.5µ ± 4% 227.4µ ± 0% +8.56% (p=0.000 n=10)
GCD10000x10000/WithoutXY-32 90.69µ ± 0% 88.48µ ± 0% -2.44% (p=0.000 n=10)
GCD10000x10000/WithXY-32 197.1µ ± 0% 200.5µ ± 0% +1.73% (p=0.000 n=10)
GCD10000x100000/WithoutXY-32 239.1µ ± 0% 242.5µ ± 0% +1.42% (p=0.000 n=10)
GCD10000x100000/WithXY-32 1.963m ± 3% 2.028m ± 0% +3.28% (p=0.000 n=10)
GCD100000x100000/WithoutXY-32 7.466m ± 0% 7.412m ± 0% -0.71% (p=0.000 n=10)
GCD100000x100000/WithXY-32 16.10m ± 2% 16.47m ± 0% +2.25% (p=0.000 n=10)
geomean 8.388µ 8.127µ -3.12%
Change-Id: I161dc409bad11bcc553bc8116449905ae5b06742
Reviewed-on: https://go-review.googlesource.com/c/go/+/650636
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
cmd/internal/obj/riscv: implement vector load/store instructions
Implement vector unit stride, vector strided, vector indexed and
vector whole register load and store instructions.
The vector unit stride instructions take an optional vector mask
register, which if specified must be register V0. If only two
operands are given, the instruction is encoded as unmasked.
The vector strided and vector indexed instructions also take an
optional vector mask register, which if specified must be register
V0. If only three operands are given, the instruction is encoded as
unmasked.
Cq-Include-Trybots: luci.golang.try:gotip-linux-riscv64
Change-Id: I35e43bb8f1cf6ae8826fbeec384b95ac945da50f
Reviewed-on: https://go-review.googlesource.com/c/go/+/631937
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Pengcheng Wang <wangpengcheng.pp@bytedance.com>