Commit Graph

370 Commits

Author SHA1 Message Date
Andreas Rheinhardt
4fc05c28f4 avfilter/x86/vf_gradfun: Remove MMXEXT func overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 version
of filter_line.
This commit therefore removes the overridden MMXEXT version
(which didn't abide by the ABI) which allows us to remove
an emms_c() from vf_gradfun.c, so that users with SSSE3 no longer
pay a price for the mere existence of an MMXEXT version.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-26 06:21:35 +02:00
Niklas Haas
843920d5d6 avfilter/x86/vf_idetdsp: add AVX2 and AVX512 implementations
The only thing that changes slightly is the horizontal sum at the end.
2025-09-21 11:02:41 +00:00
Niklas Haas
4c067d0778 avfilter/x86/vf_idetdsp: generalize 8-bit macro
This is mostly compatible with AVX as well, so turn it into a macro.
2025-09-21 11:02:41 +00:00
Niklas Haas
326abf359f avfilter/vf_idetdsp: use consistent uint8_t pointer type
Even for 16-bit DSP functions. Instead, cast the pointer inside the
function.
2025-09-21 11:02:41 +00:00
Niklas Haas
60dbcc5321 avfilter/vf_idetdsp: pass actual bit depth
More informative and IMO cleaner; some implementations may want to
differentiate by exact bit depth or support 32 bit down the line.
2025-09-21 11:02:41 +00:00
Niklas Haas
5830743363 avfilter/vf_idet: separate DSP parts
To avoid pulling in the entire libavfilter when using the DSP functions
from checkasm.

The rest of the struct is not needed outside vf_idet.c and was moved there.
2025-09-21 11:02:41 +00:00
Andreas Rheinhardt
a35c91dc14 avfilter/vf_colordetect: Rename header to vf_colordetectdsp.h
It is more in line with our naming conventions.

Reviewed-by: Martin Storsjö <martin@martin.st>
Reviewed-by: Niklas Haas <ffmpeg@haasn.dev>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-09-16 18:22:24 +02:00
Niklas Haas
ba8aa0e7b3 avfilter/x86/vf_overlay: simplify function signature
No reason to pass all the variables again, if we're already passing the
context.
2025-09-02 17:06:25 +02:00
Niklas Haas
6d6bbdaab0 avfilter/vf_overlay: rename variables for clarity
`is_straight`, `alpha_mode` etc. are more consistently named to refer to
either the main image, or the overlay.
2025-09-02 17:06:25 +02:00
Niklas Haas
6f3eddbedd avfilter/vf_overlay: configure alpha mode on the link
And use the link-tagged value instead of the hard-coded parameter.
2025-09-02 17:06:25 +02:00
Niklas Haas
f07c12d806 avfilter/x86/vf_colordetect: fix alpha detect tail handling
This wrapping logic still considered any nonzero return from the ASM function
to be the overall result, but this is not true since the addition of
FF_ALPHA_TRANSPARENT.

Fix it by only early returning if FF_ALPHA_STRAIGHT is detected.

Fixes: 9b8b78a815
See-Also: https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20301#issuecomment-4802
2025-09-01 15:33:43 +00:00
Niklas Haas
9b8b78a815 avfilter/vf_colordetect: detect fully opaque alpha planes
It can be useful to know if the alpha plane consists of fully opaque
pixels or not, in which case it can e.g. safely be stripped.

This only requires a very minor modification to the AVX2 routines, adding
an extra AND on the read alpha value with the reference alpha value, and a
single extra cheap test per line.

detect_alpha_8_full_c:                                2849.1 ( 1.00x)
detect_alpha_8_full_avx2:                              260.3 (10.95x)
detect_alpha_8_full_avx512icl:                         130.2 (21.87x)
detect_alpha_8_limited_c:                             8349.2 ( 1.00x)
detect_alpha_8_limited_avx2:                           756.6 (11.04x)
detect_alpha_8_limited_avx512icl:                      364.2 (22.93x)
detect_alpha_16_full_c:                               1652.8 ( 1.00x)
detect_alpha_16_full_avx2:                             236.5 ( 6.99x)
detect_alpha_16_full_avx512icl:                        134.6 (12.28x)
detect_alpha_16_limited_c:                            5263.1 ( 1.00x)
detect_alpha_16_limited_avx2:                          797.4 ( 6.60x)
detect_alpha_16_limited_avx512icl:                     400.3 (13.15x)
2025-08-18 18:50:00 +00:00
Niklas Haas
c96ccd78fc avfilter/vf_colordetect: rename p, q, k variables for clarity
Purely cosmetic.

Motivated in part because I want to depend on the assumption that P
represents the maximum alpha channel value.
2025-08-18 18:50:00 +00:00
James Almer
3f58c9df14 avfilter/x86/vf_bwdif: use the correct preprocessor check
Signed-off-by: James Almer <jamrial@gmail.com>
2025-08-03 19:26:18 -03:00
Niklas Haas
7f00e24d70 vf_bwdif: add AVX512 implementation
I also tried replacing some of the instructions by more elaborate ones
using masks, but I found no performance gain significant enough to be worth
maintaining two code paths, so this implementation merely replaces the AVX2
implementation by drop-in AVX512 equivalents.

bwdif8_c:                                             6362.2 ( 1.00x)
bwdif8_sse2:                                          1004.9 ( 6.33x)
bwdif8_ssse3:                                          946.0 ( 6.73x)
bwdif8_avx2:                                           477.9 (13.31x)
bwdif8_avx512:                                         273.3 (23.28x)

bwdif10_c:                                            6341.5 ( 1.00x)
bwdif10_sse2:                                          872.4 ( 7.27x)
bwdif10_ssse3:                                         803.4 ( 7.89x)
bwdif10_avx2:                                          416.7 (15.22x)
bwdif10_avx512:                                        224.3 (28.27x)

Realtime test at 3840x2160 yuv420p:

avx2:   frame=20000 fps=3370 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed=67.4x elapsed=0:00:05.93
avx512: frame=20000 fps=5077 q=-0.0 Lsize=N/A time=00:06:40.00 bitrate=N/A speed= 102x elapsed=0:00:03.93

The use of this function is gated behind avx512icl so that it doesn't
downclock on Skylake.
2025-08-03 22:13:51 +00:00
Timo Rothenpieler
262d41c804 all: fix typos found by codespell 2025-08-03 13:48:47 +02:00
James Almer
a01dc3aa27 avfilter/x86/vf_colordetect: add missing preprocessor checks
Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-21 18:03:22 -03:00
James Almer
c62813a057 avfilter/x86/vf_colordetect: make the AVX512 functions run only on ICL targets or newer
For detect_range, the usage of vpbroadcast{b,w} requires the AVX512BW extension, and for
detect_alpha we don't want ZMM instructions downclocking old CPUs.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-21 17:25:28 -03:00
James Almer
70fc4e5909 avfilter/x86/vf_colordetect_init: don't enable ASM functions on targets where it's known they will be slower
Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-21 16:58:51 -03:00
James Almer
fdca209f1f avfilter/x86/vf_colordetect: don't use rax to return a 32bit integer
Fixes compilation on x86_32 targets

Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-21 16:58:36 -03:00
James Almer
14f4478354 avfilter/x86/vf_colordetect: fix use of AVX512 instruction in AVX2 function on non Unix64 targets
Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-21 16:52:46 -03:00
Niklas Haas
8b647b3f8a avfilter/vf_colordetect: add x86 SIMD implementation
alphadetect8_full_c:                                  5658.2 ( 1.00x)
alphadetect8_full_avx2:                                215.1 (26.31x)
alphadetect8_full_avx512:                              133.5 (42.40x)
alphadetect8_limited_c:                               7391.5 ( 1.00x)
alphadetect8_limited_avx2:                             649.3 (11.38x)
alphadetect8_limited_avx512:                           330.5 (22.36x)
alphadetect16_full_c:                                 3027.4 ( 1.00x)
alphadetect16_full_avx2:                               209.4 (14.46x)
alphadetect16_full_avx512:                             141.4 (21.41x)
alphadetect16_limited_c:                              3880.9 ( 1.00x)
alphadetect16_limited_avx2:                            734.9 ( 5.28x)
alphadetect16_limited_avx512:                          349.2 (11.11x)
rangedetect8_c:                                       5854.2 ( 1.00x)
rangedetect8_avx2:                                     138.9 (42.15x)
rangedetect8_avx512:                                   106.2 (55.12x)
rangedetect16_c:                                      4122.0 ( 1.00x)
rangedetect16_avx2:                                    138.6 (29.74x)
rangedetect16_avx512:                                  104.1 (39.60x)
2025-07-21 18:10:25 +02:00
James Almer
85f2911891 avfilter/x86/vf_blackdetect: add missing preprocessor check
Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-18 15:17:02 -03:00
James Almer
ee4ff3f706 avfilter/x86/vf_blackdetect_init: don't enable the ASM functions on targets where it's known they will be slower
Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-18 13:05:44 -03:00
James Almer
f263192f0e avfilter/x86/vf_blackdetect: don't use rax to return a 32bit integer
Fixes compilation on x86_32.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-07-18 13:05:44 -03:00
Niklas Haas
75cd42c48a avfilter/vf_blackdetect: add AVX2 SIMD version
Requested by a user. Even with autovectorization enabled, the compiler
performs a quite poor job of optimizing this function, due to not being
able to take advantage of the pmaxub + pcmpeqb trick for counting the number
of pixels less than or equal-to a threshold.

blackdetect8_c:                                       4625.0 ( 1.00x)
blackdetect8_avx2:                                     155.1 (29.83x)
blackdetect16_c:                                      2529.4 ( 1.00x)
blackdetect16_avx2:                                    163.6 (15.46x)
2025-07-18 10:47:31 +02:00
Niklas Haas
e44a1aaeec avfilter/x86/scene_sad: add high bit depth AVX2/AVX512 version
Since psadbw only exists for 8-bits, we have to emulate it for 16-bit
inputs. The simplest sequence is to use a normal subtraction, which is safe
as long as the inputs do not exceed 32767 - so limit this implementation
to 15-bit inputs and below.

For 16-bit inputs, we could in theory instead use a pminw / pmaxw to ensure
the resulting difference does not overflow, but this is slower, and also
breaks the subsequent use of pmaddwd, so I opted to skip 16-bit SIMD for
now.

scene_sad10_c:                                      114175.6 ( 1.00x)
scene_sad10_avx2:                                     9617.7 (11.87x)
scene_sad10_avx512:                                   5208.8 (21.92x)
scene_sad12_c:                                      114537.8 ( 1.00x)
scene_sad12_avx2:                                     9614.0 (11.91x)
scene_sad12_avx512:                                   5186.3 (22.08x)
scene_sad14_c:                                      114113.9 ( 1.00x)
scene_sad14_avx2:                                     9612.9 (11.87x)
scene_sad14_avx512:                                   5186.0 (22.00x)
scene_sad15_c:                                      114108.9 ( 1.00x)
scene_sad15_avx2:                                     9612.3 (11.87x)
scene_sad15_avx512:                                   5186.4 (22.00x)
scene_sad16_c:                                      114136.0 ( 1.00x)
2025-07-17 12:26:06 +02:00
Niklas Haas
91f2d146d4 avfilter/x86/scene_sad: add AVX512 implementation
Trivial to add, but a lot faster (on my machine).

scene_sad8_c:                                       114476.4 ( 1.00x)
scene_sad8_sse2:                                      8644.3 (13.24x)
scene_sad8_avx2:                                      4520.1 (25.33x)
scene_sad8_avx512:                                    3153.0 (36.31x)
2025-07-17 12:26:06 +02:00
Niklas Haas
dc61b74c1d avfilter/scene_sad: pass true depth to ff_scene_sad_get_fn()
I need to be able to distinguish between 10/12/14 and 16 bit depths, for
overflow reasons.
2025-07-17 12:26:05 +02:00
James Almer
dbe94e1110 avfilter/x86/f_ebur128: replace AVX2 instruction with AVX equivalent
Using vpbroadcastq in an AVX function will result in SIGILL errors on pre
Haswell/Zen processors.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-06-22 09:31:44 -03:00
Niklas Haas
daef348574 avfilter/x86/f_ebur128: implement AVX peak calculation
Stereo only, for simplicity. Slightly faster than the C code.
2025-06-21 17:28:39 +02:00
Niklas Haas
53e03ec8af avfilter/x86/f_ebur128: add x86 AVX implementation
Processes two channels in parallel, using 128-bit XMM registers.

In theory, we could go up to YMM registers to process 4 channels, but this is
not a gain except for relatively high channel counts (e.g. 7.1), and also
complicates the sample load/store operations considerably.

I decided to only add an AVX variant, since the C code is not substantially
slower enough to justify a separate function just for ancient CPUs.
2025-06-21 17:21:36 +02:00
Andreas Rheinhardt
0435cd5a62 avfilter/x86/vf_spp: Remove permutation-specific code
The MMX requantize functions have the MMX permutation
(i.e. FF_IDCT_PERM_SIMPLE) hardcoded and therefore
check for the used permutation (namely via a CRC).
Yet this is very ugly and could even lead to misdetection;
furthermore, since d7246ea9f2
the permutation used here is de-facto and since
bfb28b5ce8 definitely
impossible on x64, making this code dead on x64.
So remove it.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-05-31 01:31:09 +02:00
James Almer
362586fcad avfilter/vf_xpsnr: remove duplicated DSP infranstructure
Fully reuse the existing one from vf_psnr, instead of halfways.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-10-07 09:33:52 -03:00
Christian Helmrich
865cd3c056 avfilter: add XPSNR filter
Add XPSNR video filter
Register new filter xpsnr.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2024-09-08 17:51:37 +02:00
Marton Balint
a69a0b689c avfilter/blend: put slice parameters to a single struct
This should make future extensibility easier.

Signed-off-by: Marton Balint <cus@passwd.hu>
2024-05-14 21:07:37 +02:00
Andreas Rheinhardt
9ec928e627 avfilter/x86/Makefile: Fix standalone build of haldclut filter
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-05-07 23:53:26 +02:00
Andreas Rheinhardt
c11d7ca2f0 avfilter/x86/Makefile: Add missing dependencies for sobel filter
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-05-07 23:53:26 +02:00
Andreas Rheinhardt
790f793844 avutil/common: Don't auto-include mem.h
There are lots of files that don't need it: The number of object
files that actually need it went down from 2011 to 884 here.

Keep it for external users in order to not cause breakages.

Also improve the other headers a bit while just at it.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-31 00:08:43 +01:00
Henrik Gramner
782c4df28d x86: Avoid using 'd' as an argument name
x86inc.asm adds defines for <argument_name>{b,w,d,q} which clashes with
the nasm d{b,w,d,q} pseudo-instructions for writing initialized data.
2024-03-24 14:53:57 +01:00
Andreas Rheinhardt
fa06f48371 avfilter/bwdifdsp: Constify
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-09-28 00:17:47 +02:00
Andreas Rheinhardt
80afcc8539 avfilter/bwdif: Add proper BWDIFDSPContext
This already avoids unnecessary indirectly included headers
in the arch-specific vf_bwdif_init.c files; it is also in
preparation for splitting the actual functions out of vf_bwdif.c.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-09-28 00:17:47 +02:00
Paul B Mahol
c5effe7d3d avfilter/x86/af_afir: add FMA3 SIMD 2023-09-17 11:11:24 +02:00
Evgeny Pavlov
cb1479faca avfilter/vf_ssim: Fix x86 assembly code for SSIM calculation
This commit fixes bug #10495

The code had several bugs related to post-loop compensation code:
- test assembly instruction performs bitwise AND operation and
generate flags used by jz branch instruction. Wrong test condition
leads to incorrect branching
- Incorrect compensation code for some branches

Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com>
2023-08-21 17:04:51 +02:00
James Almer
aca8ceb870 x86/vf_bwdif_init: limit AVX2 functions using 256bit vectors to cpus known to be fast with it
Signed-off-by: James Almer <jamrial@gmail.com>
2023-03-25 13:27:20 -03:00
James Darnley
073ec3b9da avfilter/bwdif: add avx2 filter_line function
8-bit:
2.24x faster (1925±1.3 vs. 859±2.2 decicycles) compared with ssse3
10-bit:
2.00x faster (1703±1.7 vs. 853±2.0 decicycles) compared with ssse3
2023-03-25 02:38:17 +01:00
James Darnley
b503b5a0cf avfilter/bwdif: move filter_line init to a dedicated function 2023-03-25 02:38:17 +01:00
Lynne
bbe95f7353 x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
Wang, Bin
459527108a libavfilter/x86/vf_convolution: fix sobel swap issue on WIN64
Reviewed by: James Almer <jamrial@gmail.com>
Signed-off-by: Wang, Bin <bin.wang@intel.com>
2022-11-21 12:28:25 +08:00
bwang30
3ab11dc5bb libavfilter/x86/vf_convolution: add sobel filter optimization and unit test with intel AVX512 VNNI
This commit enabled assembly code with intel AVX512 VNNI and added unit test for sobel filter

sobel_c: 4537
sobel_avx512icl 2136

Signed-off-by: bwang30 <bin.wang@intel.com>
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
2022-11-14 10:04:16 +08:00