122011 Commits

Author SHA1 Message Date
Andreas Rheinhardt
1d47ae65bf avcodec/tableprint_vlc: Unbreak hardcoded tables
Forgotten in d8ffec5bf9.
Fixes issue #21102.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-05 11:31:23 +01:00
Arpad Panyik
1f30ff30fb swscale: Add AArch64 Neon path for xyz12Torgb48 LE
Add optimized Neon code path for the little endian case of the
xyz12Torgb48 function. The innermost loop processes the data in 4x2
pixel blocks using software gathers with the matrix multiplication
and clipping done by Neon.

Relative runtime of micro benchmarks after this patch on some
Cortex and Neoverse CPU cores:

 xyz12le_rgb48le    X1      X3      X4    X925      V2
 16x4_neon:       2.55x   4.34x   3.84x   3.31x   3.22x
 32x4_neon:       2.39x   3.63x   3.22x   3.35x   3.29x
 64x4_neon:       2.37x   3.31x   2.91x   3.33x   3.27x
 128x4_neon:      2.34x   3.28x   2.91x   3.35x   3.24x
 256x4_neon:      2.30x   3.17x   2.91x   3.32x   3.10x
 512x4_neon:      2.26x   3.10x   2.91x   3.30x   3.07x
 1024x4_neon:     2.26x   3.07x   2.96x   3.30x   3.05x
 1920x4_neon:     2.26x   3.06x   2.93x   3.28x   3.04x

 xyz12le_rgb48le   A76     A78    A715    A720    A725
 16x4_neon:       2.33x   2.28x   2.53x   3.33x   3.19x
 32x4_neon:       2.35x   2.18x   2.45x   3.23x   3.24x
 64x4_neon:       2.35x   2.16x   2.42x   3.15x   3.21x
 128x4_neon:      2.35x   2.13x   2.39x   3.00x   3.09x
 256x4_neon:      2.36x   2.12x   2.35x   2.85x   2.99x
 512x4_neon:      2.35x   2.14x   2.35x   2.78x   2.95x
 1024x4_neon:     2.31x   2.09x   2.33x   2.80x   2.91x
 1920x4_neon:     2.30x   2.07x   2.32x   2.81x   2.94x

 xyz12le_rgb48le   A55    A510    A520
 16x4_neon:       2.09x   1.92x   2.36x
 32x4_neon:       2.05x   1.89x   2.38x
 64x4_neon:       2.02x   1.77x   2.35x
 128x4_neon:      1.96x   1.74x   2.25x
 256x4_neon:      1.90x   1.72x   2.19x
 512x4_neon:      1.83x   1.75x   2.16x
 1024x4_neon:     1.83x   1.62x   2.15x
 1920x4_neon:     1.82x   1.60x   2.15x

Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
2025-12-05 10:28:18 +00:00
Arpad Panyik
a13871ae19 checkasm: Add xyz12Torgb48le test
Add checkasm coverage for the XYZ12LE to RGB48LE path via the
ctx->xyz12Torgb48 hook. Integrate the test into the build and runner,
exercise a variety of widths/heights, compare against the C reference,
and benchmark when width is multiple of 4.

This improves test coverage for the new function pointer in preparation
for architecture-specific implementations in subsequent commits.

Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
2025-12-05 10:28:18 +00:00
Arpad Panyik
ef651b84ce swscale: Refactor XYZ+RGB state and add function hooks
Prepare for xyz12Torgb48 architecture-specific optimizations in
subsequent patches by:
 - Grouping XYZ+RGB gamma LUTs and 3x3 matrices into SwsColorXform
   (ctx->xyz2rgb and ctx->rgb2xyz), replacing scattered fields.
 - Dropping the unused last matrix column giving the same or smaller
   SwsInternal size.
 - Renaming ff_xyz12Torgb48 and ff_rgb48Toxyz12 and routing calls via
   the new per-context function pointer (ctx->xyz12Torgb48 and
   ctx->rgb48Toxyz12) in graph.c and swscale.c.
 - Adding ff_sws_init_xyzdsp and invoking it in swscale init paths
   (normal and unscaled).
 - Making fill_xyztables public to ease its setup later in checkasm.

These modifications do not introduce any functional changes.

Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
2025-12-05 10:28:18 +00:00
Andreas Rheinhardt
9e038fd959 swscale/tests/swscale: Fix typo
Reviewed-by: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-05 10:42:01 +01:00
James Almer
52c84b06d5 avfilter/f_sidedata: also handle global side data in filter links
Should fix issue #21071

Signed-off-by: James Almer <jamrial@gmail.com>
2025-12-04 13:50:45 -03:00
Andreas Rheinhardt
e0845ec2cf avformat/movenc: Fix leak of IAMFContext on error
Forgotten in 5b87869c09.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 16:15:09 +00:00
Lynne
f80addbb07 ffv1enc_vulkan: fix encoding with large contexts
When RGB_LINECACHE == 2, then top2 is not the current line.
2025-12-04 16:53:58 +01:00
Andreas Rheinhardt
4b6e40a298 avcodec/vp8dsp: Don't compile unused functions
The width 16 epel functions never use four taps in any direction*,
so don't build said functions. Saves 4352B of .text and 89B of
.text.unlikely here.

*: mx and my in vp8_mc_luma() are always even.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
9cff236e2f avcodec/riscv/vp8dsp_rvv: Remove unused functions
Only the sixtap functions are used for size 16.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
050c80a526 avcodec/x86/vp8dsp: Don't use saturated addition when unnecessary
For the epel functions, there can be no overflow as long as the sum
contains only one of the two large central coefficients; for bilinear
functions, there can be no overflow whatsoever.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
575e9e9c08 avcodec/x86/vp8dsp: Reduce number of coefficient tables
By changing the permutations used in the epel8_h{4,6} case
we can simply reuse the coefficient tables from the vertical epel
filters.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
99fb257f58 avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h6_ssse3
Doubling the register width allowed to avoid a pshufb and a pmaddubsw.

Old benchmarks:
vp8_put_epel4_h6_c:                                    115.9 ( 1.00x)
vp8_put_epel4_h6_ssse3:                                 20.2 ( 5.74x)
vp8_put_epel4_h6v4_c:                                  276.3 ( 1.00x)
vp8_put_epel4_h6v4_ssse3:                               58.6 ( 4.71x)
vp8_put_epel4_h6v6_c:                                  363.6 ( 1.00x)
vp8_put_epel4_h6v6_ssse3:                               62.5 ( 5.82x)

New benchmarks:
vp8_put_epel4_h6_c:                                    116.4 ( 1.00x)
vp8_put_epel4_h6_ssse3:                                 16.0 ( 7.29x)
vp8_put_epel4_h6v4_c:                                  280.9 ( 1.00x)
vp8_put_epel4_h6v4_ssse3:                               44.3 ( 6.33x)
vp8_put_epel4_h6v6_c:                                  365.6 ( 1.00x)
vp8_put_epel4_h6v6_ssse3:                               53.1 ( 6.89x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
3135bc0d3a avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_h4_ssse3
Doubling the register width allows to use only one pshufb and pmaddubsw.

Old benchmarks:
vp8_put_epel4_h4_c:                                     82.8 ( 1.00x)
vp8_put_epel4_h4_ssse3:                                 13.9 ( 5.96x)

New benchmarks:
vp8_put_epel4_h4_c:                                     82.7 ( 1.00x)
vp8_put_epel4_h4_ssse3:                                 11.7 ( 7.08x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
714cbf1c70 avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_v4_ssse3
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).

Old benchmarks:
vp8_put_epel4_v4_c:                                     96.8 ( 1.00x)
vp8_put_epel4_v4_ssse3:                                 28.2 ( 3.43x)

New benchmarks:
vp8_put_epel4_v4_c:                                     95.1 ( 1.00x)
vp8_put_epel4_v4_ssse3:                                 22.8 ( 4.17x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
f017806829 avcodec/x86/vp8dsp: Don't use MMX registers in ff_put_vp8_epel4_v6_ssse3
Switching to xmm registers allows to process two rows in parallel,
leading to speedups. It is also ABI compliant (no more missing emms).

Old benchmarks:
vp8_put_epel4_v6_c:                                    132.8 ( 1.00x)
vp8_put_epel4_v6_ssse3:                                 34.3 ( 3.87x)

New benchmarks:
vp8_put_epel4_v6_c:                                    131.5 ( 1.00x)
vp8_put_epel4_v6_ssse3:                                 27.1 ( 4.86x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
7411998757 avcodec/x86/vp8dsp: Avoid unpacking multiple times
Always pair row i with row i+2 for the vertical four-tap filter
and row i+3 for the vertical six-tap filter (instead of pairing
the first with the sixth, the second with the third and the fourth
and the fifth). This allows to unpack each row only once instead
of (at most) three times.

Old benchmarks:
vp8_put_epel4_v4_c:                                     98.4 ( 1.00x)
vp8_put_epel4_v4_ssse3:                                 28.6 ( 3.44x)
vp8_put_epel4_v6_c:                                    131.6 ( 1.00x)
vp8_put_epel4_v6_ssse3:                                 38.5 ( 3.42x)
vp8_put_epel8_v4_c:                                    362.5 ( 1.00x)
vp8_put_epel8_v4_sse2:                                  63.8 ( 5.68x)
vp8_put_epel8_v4_ssse3:                                 44.4 ( 8.16x)
vp8_put_epel8_v6_c:                                    538.3 ( 1.00x)
vp8_put_epel8_v6_sse2:                                  86.5 ( 6.22x)
vp8_put_epel8_v6_ssse3:                                 57.0 ( 9.44x)
vp8_put_epel16_v6_c:                                  1044.6 ( 1.00x)
vp8_put_epel16_v6_sse2:                                158.0 ( 6.61x)
vp8_put_epel16_v6_ssse3:                               106.7 ( 9.79x)

New benchmarks:
vp8_put_epel4_v4_c:                                    100.0 ( 1.00x)
vp8_put_epel4_v4_ssse3:                                 28.4 ( 3.52x)
vp8_put_epel4_v6_c:                                    131.7 ( 1.00x)
vp8_put_epel4_v6_ssse3:                                 34.3 ( 3.84x)
vp8_put_epel8_v4_c:                                    364.4 ( 1.00x)
vp8_put_epel8_v4_sse2:                                  63.7 ( 5.72x)
vp8_put_epel8_v4_ssse3:                                 43.3 ( 8.42x)
vp8_put_epel8_v6_c:                                    550.2 ( 1.00x)
vp8_put_epel8_v6_sse2:                                  86.4 ( 6.37x)
vp8_put_epel8_v6_ssse3:                                 52.9 (10.40x)
vp8_put_epel16_v6_c:                                  1052.5 ( 1.00x)
vp8_put_epel16_v6_sse2:                                158.3 ( 6.65x)
vp8_put_epel16_v6_ssse3:                                98.9 (10.64x)

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
24cdd4100d avcodec/x86/vp8dsp_init: Remove unused macro
Forgotten in 6a551f1405.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
76900089fb avcodec/x86/vp8dsp: Avoid reload
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
86aa1b81ec avcodec/x86/vp8dsp: Increment src pointer earlier
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
e59ed3470d avcodec/x86/vp8dsp: Directly use negated stride
There is a register available. No change in benchmarks here.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:37 +01:00
Andreas Rheinhardt
8fb6b0c733 avcodec/x86/vp8dsp: Don't use MMX registers in put_vp8_pixels8
Use GPRs on x64 and xmm registers else (using GPRs reduces codesize).
This avoids clobbering the floating point state and therefore no longer
breaks the ABI.
No change in benchmarks here.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:36 +01:00
Andreas Rheinhardt
ed5e0f9c68 avcodec/x86/vp8dsp: Remove MMXEXT functions overridden by SSSE3
SSSE3 is already quite old (introduced 2006 for Intel, 2011 for AMD),
so that the overwhelming majority of our users (particularly those
that actually update their FFmpeg) will be using the SSSE3 versions.
This commit therefore removes the MMX(EXT) functions overridden
by them (which don't abide by the ABI) to get closer to a removal
of emms_c.

Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-04 15:17:36 +01:00
Lynne
9b14ea0aa1 vulkan_dpx: fix alignment issue
12-bit images apparently require mod-32 alignment for each line.
Go figure.
2025-12-04 15:08:46 +01:00
Oliver Chang
d6458f6a8b avcodec/aacdec: Fix heap-use-after-free in USAC decoding
A heap-use-after-free vulnerability was identified in
`libavcodec/aac/aacdec.c`.  When `che_configure` frees a
`ChannelElement` (`ac->che[type][id]`), it failed to clear all
references to it in `ac->tag_che_map`.  `ac->tag_che_map` caches
pointers to `ChannelElement`s and can contain cross-type mappings (e.g.,
a `TYPE_SCE` tag mapping to a `TYPE_LFE` element).

In a USAC stream reconfiguration scenario, an LFE element was freed, but
a stale pointer remained in `ac->tag_che_map`. Subsequent calls to
`ff_aac_get_che` returned this dangling pointer, leading to a crash in
`decode_usac_core_coder`.

This commit fixes the issue by iterating over the entire
`ac->tag_che_map` in `che_configure` and clearing any entries that point
to the `ChannelElement` about to be freed, ensuring no dangling pointers
remain.

Fixes: https://issues.oss-fuzz.com/issues/440220467
2025-12-04 09:34:32 +00:00
Xia Tao
7922d4ca7d avcodec/wasm/hevc: fix typo in butterfly macro
Signed-off-by: Xia Tao <xiatao@gmail.com>
2025-12-04 08:40:43 +00:00
stevxiao
7b2ae2ccf7 avcodec/d3d12va_encode: add intra refresh support for d3d12va encode
Intra refresh is a technique that gradually refreshes the video by encoding rows or regions as intra macroblocks/CTUs spread over multiple frames, rather than using periodic I-frames.
This provides better error resilience for video streaming while maintaining more consistent bitrate.

Disable Intra Refresh (This is the default)
ffmpeg -init_hw_device d3d12va -hwaccel d3d12va -hwaccel_output_format d3d12 \
-i input.mp4 \
-c:v h264_d3d12va \
-intra_refresh_mode none \
-intra_refresh_duration 30 \
-g 60 \
output.h264

Enable Intra Refresh
ffmpeg -init_hw_device d3d12va -hwaccel d3d12va -hwaccel_output_format d3d12 \
-i input.mp4 \
-c:v h264_d3d12va \
-intra_refresh_mode row_based \
-intra_refresh_duration 30 \
-g 60 \
output.h264

Parameters
- `-intra_refresh_mode`: Set to `row_based` to enable row-based intra refresh, or `NONE` to disable
- `-intra_refresh_duration`: Number of frames over which to spread the intra refresh (default: 0 = use GOP size)
- `-g`: GOP size (should typically be larger than intra refresh duration)
2025-12-04 08:26:26 +00:00
Michael Niedermayer
12e7d095b1 Revert "avformat/rawdec: set framerate in codec parameters"
Fixes single image videos
this works and creates our single image video
./ffmpeg -i lena.pnm /tmp/file.m2v

this fails after 3d96d83a0a:
./ffmpeg -i /tmp/file.m2v /tmp/file.jpg -y

This reverts commit 3d96d83a0a.
2025-12-04 01:59:04 +00:00
Kacper Michajłow
9ed71a837b avutil/vulkan: fix device memory size truncation
size_t cannot fit VK_WHOLE_SIZE on 32-bit builds.

Fixes: warning: conversion from 'long long unsigned int' to 'size_t' {aka 'unsigned int'} changes value from '18446744073709551615' to '4294967295'

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-12-03 23:45:44 +00:00
Kacper Michajłow
3cc10b5ff6 fftools/cmdutils: use strcpy directly, the length is computed already
There is no need to scan for NULL, if we inject it ourselves.

Fixes: warning: 'strncat' specified bound 10 equals source length [-Wstringop-overflow=]
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-12-03 23:45:44 +00:00
Kacper Michajłow
f7b7972f78 avdevice/gdigrab: suppress int to pointer cast warning
Fixes: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2025-12-03 23:45:44 +00:00
wutno
f4312ea138 avformat/xmv: Handle zero sized packet at end of file
Some XMVs introduce a blank packet at the end of the stream. Previously, we
didn't account for this and returned AVERROR_INVALIDDATA, indicating an issue
with the file. Instead, let's check for this and close out with AVERROR_EOF.
2025-12-03 22:09:20 +00:00
Lynne
a8e8daa276 hwcontext_vulkan: fix final error to let old header files work
........
2025-12-03 22:34:32 +01:00
Jack Lau
c4b050fd67 tests/fate/filter-video: add two feedback tests
- Add fate-filter-feedback-yadif

- add fate-filter-feedback-hflip

Signed-off-by: Jack Lau <jacklau1222gm@gmail.com>
2025-12-03 21:23:51 +00:00
Jack Lau
3f0842294f avfilter/vf_feedback: fix feedback block
Fix #20940

The feedback and its sub-filter both request frame
from each other, casuing block since 4440e499ba

The feedback should only request inputs[1] once
rather than continuously request frame cause blocking.

This patch add check whether feedback already request
inputs[1] via ff_outlink_frame_wanted(ctx->outputs[1]),
if true, then exit and waiting inputs[0] because it means
we need more frames input to proceed.

Signed-off-by: Jack Lau <jacklau1222gm@gmail.com>
2025-12-03 21:23:51 +00:00
Lynne
bce14bb160 hwcontext_vulkan: fix compilation with older header versions 2025-12-03 21:22:54 +01:00
Oliver Chang
041d4f010e libavcodec/prores_raw: Fix heap-buffer-overflow in decode_frame
Fixes a heap-buffer-overflow in `decode_frame` where `header_len` read
from the bitstream was not validated against the remaining bytes in the
input buffer (`gb`). This allowed `gb_hdr` to be initialized with a size
exceeding the actual packet data, leading to an out-of-bounds read.

The fix adds a check to ensure `bytestream2_get_bytes_left(&gb)` is
greater than or equal to `header_len - 2` before initializing `gb_hdr`.

Fixes: https://issues.oss-fuzz.com/issues/439711053
2025-12-03 16:40:02 +00:00
Andreas Rheinhardt
e3e3265034 tests/checkasm/mpegvideo_unquantize: Add missing const
Fixes this test under UBSan:
runtime error: call to function dct_unquantize_mpeg1_intra_c through pointer to incorrect function type 'void (*)(struct MpegEncContext *, short *, int, int)'
I don't know how I could forget this.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 14:17:58 +01:00
Martin Storsjö
b98179cec6 avcodec/{arm,neon}/mpegvideo: Readd a missed initialization
This was accidentally removed in
357fc5243c.

This fixes test failures when built with Clang and MSVC;
surprisingly, the checkasm test did seem to pass when built with
GCC. Clang and MSVC also warn about the use of the uninitialized
variable, while GCC didn't.
2025-12-03 13:53:54 +02:00
Andreas Rheinhardt
5d9270df7f libavutil/internal: Remove {SIZE,PTRDIFF}_SPECIFIER
Possible since 222127418b.

Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 11:52:54 +01:00
Andreas Rheinhardt
c22c2c5e03 avcodec/mpegvideo: Port dct_unquantize_mpeg2_intra_mmx to SSE2
Benefits from wider registers.

Benchmarks:
dct_unquantize_mpeg2_intra_c:                          228.2 ( 1.00x)
dct_unquantize_mpeg2_intra_mmx:                         28.2 ( 8.10x)
dct_unquantize_mpeg2_intra_sse2:                        18.4 (12.37x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
6e2153111d avcodec/x86/mpegvideo: Port dct_unquantize_mpeg2_inter_mmx to SSSE3
Benefits from wider registers, pabsw and psignw.

Benchmarks:
dct_unquantize_mpeg2_inter_c:                          131.2 ( 1.00x)
dct_unquantize_mpeg2_inter_mmx:                         50.2 ( 2.62x)
dct_unquantize_mpeg2_inter_ssse3:                       20.5 ( 6.38x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
60084b1369 avcodec/x86/mpegvideo: Port MPEG-1 unquantize functions to SSSE3
Benefits from wider registers and pabsw, psignw.

Benchmarks:
dct_unquantize_mpeg1_inter_c:                          343.0 ( 1.00x)
dct_unquantize_mpeg1_inter_mmx:                         50.6 ( 6.78x)
dct_unquantize_mpeg1_inter_ssse3:                       17.2 (19.94x)
dct_unquantize_mpeg1_intra_c:                          352.1 ( 1.00x)
dct_unquantize_mpeg1_intra_mmx:                         48.8 ( 7.22x)
dct_unquantize_mpeg1_intra_ssse3:                       19.5 (18.03x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
1cb987d25b avcodec/x86/mpegvideo: Port dct_unquantize_h263_{intra,inter}_mmx to SSSE3
It benefits from wider registers and psignw.

Benchmarks:
dct_unquantize_h263_inter_c:                            88.3 ( 1.00x)
dct_unquantize_h263_inter_mmx:                          24.7 ( 3.58x)
dct_unquantize_h263_inter_ssse3:                         9.3 ( 9.47x)
dct_unquantize_h263_intra_c:                            93.7 ( 1.00x)
dct_unquantize_h263_intra_mmx:                          30.6 ( 3.06x)
dct_unquantize_h263_intra_ssse3:                        16.5 ( 5.69x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
a9a23925df avcodec/x86/mpegvideo: Don't duplicate register
Currently several inline ASM blocks used a value as
an input and rax as clobber register. The input value
was just moved into the register which then served as loop
counter. This is wasteful, as one can just use the value's
register directly as loop counter.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
1fa8ffc1db avcodec/x86/mpegvideo: Improve unquantizing MPEG-2 intra blocks
Unquantizing involves calculating
    (block[j] * qscale * quant_matrix[j]) / 16
where / rounds towards zero. Arithmetic right shifts
naturally round towards -inf, so the earlier code
calculated the absolute value first, then used a right-shift
and then negated the result if necessary.

This commit uses a different procedure: It biases the product
for negative values of block[j] by 0xf. The combination of
this and the arithmetic right shift is the same as rounding
towards zero.

Furthermore, a write-only store to mm7 has been removed.

Benchmarks:
dct_unquantize_mpeg2_intra_c:                          214.3 ( 1.00x)
dct_unquantize_mpeg2_intra_mmx (old):                   43.0 ( 4.98x)
dct_unquantize_mpeg2_intra_mmx (new):                   28.4 ( 7.56x)

(The bitexact flag and the test for correctness have beem removed
from checkasm for the benchmarks.)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
6d56807a06 avcodec/x86/mpegvideo: Use correct inline assembly constraints
The H.263 unquantize functions modified an input parameter.
(And they did so since this code was added in
7f3f5ec87b. I am surprised
that this didn't cause issues, particularly with the intra function.)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
0f7cc6aeea avcodec/mpegvideo: Move ff_init_scantable() to mpegvideo_unquantize.c
This is necessary so that the mpegvideo_unquantize checkasm test
does not pull mpegvideo.o and then all of libavcodec into checkasm.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:43 +01:00
Andreas Rheinhardt
357fc5243c avcodec/{arm,neon}/mpegvideo: Fix h263 unquantize functions
These functions currently operate on the assumption that the number
of coefficients to process is always of the form 16k+m with m<=4 or >8.
Yet this is not true when the IDCT permutation is of type FF_IDCT_PERM_LIBMPEG2
(i.e. when FF_IDCT_INT is in use).

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:39 +01:00
Andreas Rheinhardt
581050a175 tests/checkasm: Add mpegvideo unquantize test
This adds a test for the mpegvideo unquantize functions.

It has been written in order to be able to easily bench
these functions. It should be noted that the random input
fed to the tested functions is not necessarily representative
of the stuff actually occuring in the wild. So benchmarks should
be taken with a grain of salt; but comparisons between two functions
that do not depend on branch predictions are valid (the usecase
for this is to port the x86 mmx functions to use xmm registers).

During testing I have found a bug in the arm/aarch64 neon optimizations
when using the LIBMPEG2 permutation (used by FF_IDCT_INT): The code
seems to be based on the presumption that the remainder of the number
of coefficients to process is always <= 4 mod 16. The test therefore
sometimes fails for these arches.

Hint: I am not certain that 16 bits are enough for the intermediate
values of all the computations involved; e.g. both FLV and MPEG-4
escape values can go beyond that after the corresponding
multiplications. The input in this test is nevertheless designed
to fit into 16 bits.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-12-03 10:23:39 +01:00