plan9: fix compilation on 386 and arm
mkfile: use C builtins when not building on amd64
Merge remote-tracking branch 'upstream/master'
arm32: loopfilter: NEON implementation of loopfilter for 16 bpc
This operates on 4 pixels as a time, while the arm64 version
operated on 8 pixels at a time.
As the registers only fit one single 4 pixel wide slice (with one
single set of input parameters and mask bits), the high level
logic for calculating those input parameters is done with GPRs
and scalar instructions instead of SIMD as in the other implementations.
arm64: loopfilter16: Fix conditions for skipping parts of the filtering
As the arm64 16 bpc loopfilter operates on a 8 pixel region at a time,
inspect 2 bits (corresponding to 4 pixels each) from these registers,
as we also shift them down by 2 bits at the end of the loop.
This should allow skipping the loopfilter altogether (or using a
smaller filter) in more cases.
arm32: loopfilter: Fix a misindented/aligned operand
arm: loopfilter: Compare L != 0 before doing a splat
x86: Rewrite wiener SSE2/SSSE3/AVX2 asm
The previous implementation did two separate passes in the horizontal
and vertical directions, with the intermediate values being stored
in a buffer on the stack. This caused bad cache thrashing.
By interleaving the horizontal and vertical passes in combination
with a ring buffer for storing only a few rows at a time the
performance is improved by a significant amount.
Also split the function into 7-tap and 5-tap versions. The latter is
faster and fairly common (always for chroma, sometimes for luma).
x86: Rename looprestoration_ssse3.asm to looprestoration_sse.asm
It contains both SSE2 and SSSE3 code.
Add miscellaneous minor wiener optimizations
Combine horizontal and vertical filter pointers into a single parameter
when calling the wiener DSP function.
Eliminate the +128 filter coefficient handling where possible.
Use smaller data types for wiener filter coefficients
Reduces memory usage by 96 bytes per sb.
Simplify msac subexp decoding
fuzzer: Test calling dav1d_picture_unref() after dav1d_close()
Covers the use case of keeping a reference to a Dav1dPicture
after closing the decoder.
Fix use of references to buffers after calling dav1d_close()
9057d286 had the side effect of causing references to buffers allocated
using memory pools to no longer be valid after closing the decoder.
Restore this functionality by making buffer pools reference counted.
arm32: looprestoration: NEON implementation of SGR for 10 bpc
Checkasm numbers: Cortex A7 A8 A53 A72 A73
selfguided_3x3_10bpc_neon: 919127.6 717942.8 565717.8 404748.0 372179.8
selfguided_5x5_10bpc_neon: 640310.8 511873.4 370653.3 273593.7 256403.2
selfguided_mix_10bpc_neon: 1533887.0 1252389.5 922111.1 659033.4 613410.6
Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
selfguided_3x3_10bpc_neon: 500706.0 367199.2 345261.2
selfguided_5x5_10bpc_neon: 361403.3 270550.0 249955.3
selfguided_mix_10bpc_neon: 846172.4 623590.3 578404.8
arm32: looprestoration: Prepare for 16 bpc by splitting code to separate files
looprestoration_common.S contains functions that can be used as is
with one single instantiation of the functions for both 8 and 16 bpc.
This file will be built once, regardless of which bitdepths are enabled.
looprestoration_tmpl.S contains functions where the source can be shared
and templated between 8 and 16 bpc. This will be included by the separate
8/16bpc implementaton files.
arm: looprestoration16: Fix comments referring to pixels as bytes
A number of other similar comments were updated to say pixels when
the 16 bpc code was written originally, but these were missed.