The last 3dnow functions have been removed in commit
5ef613bcb0, so don't test
it in checkasm.
(This will affect only one test, namely scalarproduct_and_madd_int16
from lossless_audiodsp: It does not use an SSSE3 function when
the 3dnow flag is set. So for old AMDs (which advertise support for
3dnow), said SSSE3 function is never tested. Now it will.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Make it possible to run only a subset of the VP9 tests
in addition to all of them (via the vp9dsp test). This
reduces noise and speeds up testing.
FATE continues to use vp9dsp.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Add checkasm coverage for the XYZ12LE to RGB48LE path via the
ctx->xyz12Torgb48 hook. Integrate the test into the build and runner,
exercise a variety of widths/heights, compare against the C reference,
and benchmark when width is multiple of 4.
This improves test coverage for the new function pointer in preparation
for architecture-specific implementations in subsequent commits.
Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
The width 16 epel functions never use four taps in any direction*,
so don't build said functions. Saves 4352B of .text and 89B of
.text.unlikely here.
*: mx and my in vp8_mc_luma() are always even.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes this test under UBSan:
runtime error: call to function dct_unquantize_mpeg1_intra_c through pointer to incorrect function type 'void (*)(struct MpegEncContext *, short *, int, int)'
I don't know how I could forget this.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This adds a test for the mpegvideo unquantize functions.
It has been written in order to be able to easily bench
these functions. It should be noted that the random input
fed to the tested functions is not necessarily representative
of the stuff actually occuring in the wild. So benchmarks should
be taken with a grain of salt; but comparisons between two functions
that do not depend on branch predictions are valid (the usecase
for this is to port the x86 mmx functions to use xmm registers).
During testing I have found a bug in the arm/aarch64 neon optimizations
when using the LIBMPEG2 permutation (used by FF_IDCT_INT): The code
seems to be based on the presumption that the remainder of the number
of coefficients to process is always <= 4 mod 16. The test therefore
sometimes fails for these arches.
Hint: I am not certain that 16 bits are enough for the intermediate
values of all the computations involved; e.g. both FLV and MPEG-4
escape values can go beyond that after the corresponding
multiplications. The input in this test is nevertheless designed
to fit into 16 bits.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
With --disable-asm, ARCH_X86_32 is set to 0, but we still build the
checkasm binary. Update the check so it is config.h agnostic.
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
It is ABI compliant and gives a tiny speedup here (and is 16B smaller).
Old benchmarks:
h264_luma_dc_dequant_idct_8_c: 33.2 ( 1.00x)
h264_luma_dc_dequant_idct_8_sse2: 16.0 ( 2.07x)
New benchmarks:
h264_luma_dc_dequant_idct_8_c: 33.0 ( 1.00x)
h264_luma_dc_dequant_idct_8_sse2: 15.0 ( 2.20x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes:
src/tests/checkasm/sw_ops.c:441:34: runtime error: shift exponent 32 is too large for 32-bit type 'int'
src/tests/checkasm/sw_ops.c:591:37: runtime error: shift exponent 32 is too large for 32-bit type 'int'
Signed-off-by: James Almer <jamrial@gmail.com>
It gains a lot because it has to operate on eight words;
it also saves 608B of .text here.
Old benchmarks:
column_fidct_c: 3365.7 ( 1.00x)
column_fidct_mmx: 1784.6 ( 1.89x)
New benchmarks:
column_fidct_c: 3361.5 ( 1.00x)
column_fidct_sse2: 801.1 ( 4.20x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This fixes an ABI violation, as mul_thrmat did not issue emms.
It seems that this ABI violation could reach the user, namely
if ff_get_video_buffer() fails. Notice that ff_get_video_buffer()
itself could fail because of this, namely if the allocator uses
floating point registers.
On x64 (where GCC already used SSE2 in the C version)
mul_thrmat_c: 4.4 ( 1.00x)
mul_thrmat_mmx: 8.6 ( 0.52x)
mul_thrmat_sse2: 4.4 ( 1.00x)
On 32bit (where SSE2 is not known to be available):
mul_thrmat_c: 56.0 ( 1.00x)
mul_thrmat_sse2: 6.0 ( 9.40x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is only used by mpegvideo decoders (for lowres). It is also only used
for bitdepth == 8, so don't build the bitdepth == 16 function at all any
more.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Improves performance and no longer breaks the ABI (by forgetting
to call emms).
Old benchmarks:
add_8x8basis_c: 43.6 ( 1.00x)
add_8x8basis_ssse3: 12.3 ( 3.55x)
New benchmarks:
add_8x8basis_c: 43.0 ( 1.00x)
add_8x8basis_ssse3: 6.3 ( 6.79x)
Notice that the output of try_8x8basis_ssse3 changes a bit:
Before this commit, it computes certain values and adds the values
for i,i+1,i+4 and i+5 before right shifting them; now it adds
the values for i,i+1,i+8,i+9. The second pair in these lists
could be avoided (by shifting xmm0 and xmm1 before adding both together
instead of only shifting xmm0 after adding them), but the former
i,i+1 is inherent in using pmaddwd. This is the reason that this
function is not bitexact.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Check only on arches that need said check.
(Btw: I do not see how h_loop_filter benefits from alignment
at all and why h_loop_filter_unaligned exists.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The old code operated on bytes and did lots of tricks
due to their limited range; it did not completely succeed,
which is why the old versions were not used when bitexact
output was requested.
In contrast, the new version is much simpler: It operates
on signed 16 bit words whose range is more than sufficient.
This means that these functions don't need a check for bitexactness
(and can be used in FATE).
Old benchmarks (for this, the AV_CODEC_FLAG_BITEXACT check has been
removed from checkasm):
h_loop_filter_c: 29.8 ( 1.00x)
h_loop_filter_mmxext: 32.2 ( 0.93x)
h_loop_filter_unaligned_c: 29.9 ( 1.00x)
h_loop_filter_unaligned_mmxext: 31.4 ( 0.95x)
v_loop_filter_c: 39.3 ( 1.00x)
v_loop_filter_mmxext: 14.2 ( 2.78x)
v_loop_filter_unaligned_c: 38.9 ( 1.00x)
v_loop_filter_unaligned_mmxext: 14.3 ( 2.72x)
New benchmarks:
h_loop_filter_c: 29.2 ( 1.00x)
h_loop_filter_sse2: 28.6 ( 1.02x)
h_loop_filter_unaligned_c: 29.0 ( 1.00x)
h_loop_filter_unaligned_sse2: 26.9 ( 1.08x)
v_loop_filter_c: 38.3 ( 1.00x)
v_loop_filter_sse2: 11.0 ( 3.47x)
v_loop_filter_unaligned_c: 35.5 ( 1.00x)
v_loop_filter_unaligned_sse2: 11.2 ( 3.18x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Blocksize 2 is Snow-only, so move all the code pertaining
to it to snow.c. Also make the put array in H264QpelContext
smaller -- it only needs three sets of 16 function pointers.
This continues 6eb8bc4217
and b0c91c2fba.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The 2x2 put functions are only used by Snow and Snow uses
only the eight bit versions. The rest is dead code. Disabling
it saved 41277B here.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is required placement by standard [[maybe_unused]] attribute, works
the same for __attribute__((unused)).
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
If we're invoked with range == UINT_MAX, we end up doing
"rnd() % (UINT_MAX + 1)", which is equal to "rnd() % 0". On
arm (on all platforms) and on MSVC i386, this ends up crashing
at runtime.
This fixes the crash.
Because of the lack of an external ABI on low-level kernels, we cannot
directly test internal functions. Instead, we construct a minimal op chain
consisting of a read, the op to be tested, and a write.
The bigger complication arises from the fact that the backend may generate
arbitrary internal state that needs to be passed back to the implementation,
which means we cannot directly call `func_ref` on the generated chain. To get
around this, always compile the op chain twice - once using the backend to be
tested, and once using the reference C backend.
The actual entry point may also just be a shared wrapper, so we need to
be very careful to run checkasm_check_func() on a pseudo-pointer that will
actually be unique for each combination of backend and active CPU flags.
We split the standard macro into its body (implementation) and declaration,
and use a macro argument in place of the raw `memcmp` call, with the major
difference that we now take the number of pixels to compare instead of the
number of bytes (to match the signature of float_near_ulp_array).
Sometimes, when measuring very small functions, rdtsc is not accurate enough
to get a reliable measurement. This increases the number of runs inside the
inner loop from 4 to 32, which should help a lot. Less important when using
the more precise linux-perf API, but still useful.
There should be no user-visible change since the number of runs is adjusted
to keep the total time spent measuring the same.
It can be useful to know if the alpha plane consists of fully opaque
pixels or not, in which case it can e.g. safely be stripped.
This only requires a very minor modification to the AVX2 routines, adding
an extra AND on the read alpha value with the reference alpha value, and a
single extra cheap test per line.
detect_alpha_8_full_c: 2849.1 ( 1.00x)
detect_alpha_8_full_avx2: 260.3 (10.95x)
detect_alpha_8_full_avx512icl: 130.2 (21.87x)
detect_alpha_8_limited_c: 8349.2 ( 1.00x)
detect_alpha_8_limited_avx2: 756.6 (11.04x)
detect_alpha_8_limited_avx512icl: 364.2 (22.93x)
detect_alpha_16_full_c: 1652.8 ( 1.00x)
detect_alpha_16_full_avx2: 236.5 ( 6.99x)
detect_alpha_16_full_avx512icl: 134.6 (12.28x)
detect_alpha_16_limited_c: 5263.1 ( 1.00x)
detect_alpha_16_limited_avx2: 797.4 ( 6.60x)
detect_alpha_16_limited_avx512icl: 400.3 (13.15x)
This safety margin was motivated by the fact that vf_premultiply sometimes
produces such illegally high values, but this has since been fixed by
603334a043, so there's no more reason to have this safety margin, at
least for our own code. (Of course, other sources may also produce such
broken files, but we shouldn't work around that - garbage in, garbage out.)
See-Also: 603334a043