We have no use for 14-bit pixel formats for now, so remove support for gray14,
which was broken due to the LSB padding issue.
Similarly YUVA at 10/12 bit was broken for the same reason.
Add a shader-based Apple ProRes decoder.
It supports all codec features for profiles up to
the 4444 XQ profile, ie.:
- 4:2:2 and 4:4:4 chroma subsampling
- 10- and 12-bit component depth
- Interlacing
- Alpha
The implementation consists in two shaders: the
VLD kernel does entropy decoding for color/alpha,
and the IDCT kernel performs the inverse transform
on color components.
Benchmarks for a 4k yuv422p10 sample:
- AMD Radeon 6700XT: 178 fps
- Intel i7 Tiger Lake: 37 fps
- NVidia Orin Nano: 70 fps
In preparation for the Vulkan hwaccel.
The existing hwaccel code was designed around
videotoolbox, which ingests the whole frame
bitstream including picture headers.
This adapts the code to accomodate lower-level,
slice-based hwaccels.
Supresses:
warning C4334: '<<': result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?)
Also drop L, as shift will never exceed 31.
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
This is how images encoded with specific transfer function should be
viewed. Image viewers that doesn't support named trc metadata, will
fallback to simple gAMA value and both of those cases should produce the
same image apperance for the viewer.
Fixes: https://github.com/mpv-player/mpv/issues/13438
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
The mismatch between neon and C functions can be reproduced
using the following bitstream and command line on aarch64 devices.
wget https://streams.videolan.org/ffmpeg/incoming/replay_intra_pred_16x16.h264
./ffmpeg -cpuflags 0 -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_ref
./ffmpeg -threads 1 -i replay_intra_pred_16x16.h264 -f framemd5 -y md5_neon
Signed-off-by: Bin Peng <pengbin@visionular.com>
Previously, the LC3 encoder only accepted planar float (AV_SAMPLE_FMT_FLTP).
This change extends support to packed float (AV_SAMPLE_FMT_FLT) by properly
handling channel layout and sample stride.
The pcm data pointer and stride are now calculated based on the sample
format: for planar, use frame->data[ch]; for packed, use frame->data[0]
with channel offset. The stride is set to 1 for planar and number of
channels for packed layout.
This enables encoding from common packed audio sources without requiring
a prior planar conversion, improving usability and efficiency.
Signed-off-by: cenzhanquan1 <cenzhanquan1@xiaomi.com>
1. Adds support for respecting the requested sample format. Previously,
the decoder always used AV_SAMPLE_FMT_FLTP. Now it checks if the
caller requested a specific format via avctx->request_sample_fmt and
honors that request when supported.
2. Improves planar/interleaved audio buffer handling. The decoding
logic now properly handles both planar and interleaved sample
formats by calculating the correct stride and buffer pointers based
on the actual sample format.
The changes include:
- Added format mapping between AVSampleFormat and lc3_pcm_format
- Implemented format selection logic in initialization.
- Updated buffer pointer calculation for planar/interleaved data.
- Maintained backward compatibility with existing behavior.
Signed-off-by: cenzhanquan1 <cenzhanquan1@xiaomi.com>
When calculating the max size of an output PNG packet, we should
include the size of a possible eXIf chunk that we may write.
This fixes a regression since d3190a64c3
as well as a bug that existed prior in the apng encoder since commit
4a580975d4.
Signed-off-by: Leo Izen <leo.izen@gmail.com>
When splitting a 5 lines image in 2 slices one will be 3 lines and thus need more space
Fixes: Assertion sc->slice_coding_mode == 0 failed at libavcodec/ffv1enc.c:1668
Fixes: 422811239/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FFV1_fuzzer-4933405139861504
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
We do not support larger tiles as we use signed int
Alternatively we can check this in apv_decode_tile_component() or init_get_bits*()
or support bitstreams above 2gb length
Fixes: init_get_bits() failure later
Fixes: 421817631/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_APV_fuzzer-4957386534354944
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This has the advantage of not violating the ABI by using
MMX registers without issuing emms; it e.g. allows
to remove an emms_c from bink.c.
This commit uses GP registers on Unix64 (there are not
enough volatile registers to do likewise on Win64) which
reduces codesize and is faster on some CPUs.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Snow calls some of the me_cmp_funcs with insufficient alignment
for the first pointer (see get_block_rd() in snowenc.c);
therefore SSE2 functions which really need this alignment
don't get set for Snow and 542765ce3e
consequently didn't remove MMXEXT functions which are overridden
by these SSE2 functions for normal codecs.
For reference, here is a command line which would segfault
if one simply used the ordinary SSE2 functions for Snow:
./ffmpeg -i mm-short.mpg -an -vcodec snow -t 0.2 -pix_fmt yuv444p \
-vstrict -2 -qscale 2 -flags +qpel -motion_est iter 444iter.avi
This commit adds unaligned SSE2 versions of these functions
and removes the MMXEXT ones. This in particular implies that
sad 16x16 now never uses MMX which allows to remove an emms_c
from ac3enc.c.
Benchmarks (u means unaligned version):
sad_0_c: 8.2 ( 1.00x)
sad_0_mmxext: 10.8 ( 0.76x)
sad_0_sse2: 6.2 ( 1.33x)
sad_0_sse2u: 6.7 ( 1.23x)
vsad_0_c: 44.7 ( 1.00x)
vsad_0_mmxext (approx): 12.2 ( 3.68x)
vsad_0_sse2 (approx): 7.8 ( 5.75x)
vsad_4_c: 88.4 ( 1.00x)
vsad_4_mmxext: 7.1 (12.46x)
vsad_4_sse2: 4.2 (21.15x)
vsad_4_sse2u: 5.5 (15.96x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The SSE2 function overriding them are currently only set
if the SSE2SLOW flag is not set and if the codec is not Snow.
The former affects only outdated processors (AMDs from
before Barcelona (i.e. before 2007)) and is therefore irrelevant.
Snow does not use the pix_abs function pointers at all,
so this is also no obstacle.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The new functions are faster than the existing exact
functions, yet get beaten by the nonexact functions
(they can avoid unpacking to words and back).
The exact (slow) MMX functions have therefore been
removed, which was actually beneficial size-wise
(416B of new functions, 619B of functions removed).
pix_abs_0_3_c: 216.8 ( 1.00x)
pix_abs_0_3_mmx: 71.8 ( 3.02x)
pix_abs_0_3_mmxext (approximative): 17.6 (12.34x)
pix_abs_0_3_sse2: 23.5 ( 9.23x)
pix_abs_0_3_sse2 (approximative): 9.9 (21.94x)
pix_abs_1_3_c: 98.4 ( 1.00x)
pix_abs_1_3_mmx: 36.9 ( 2.66x)
pix_abs_1_3_mmxext (approximative): 9.2 (10.73x)
pix_abs_1_3_sse2: 14.8 ( 6.63x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Improves performance and no longer breaks the ABI (by forgetting
to call emms).
Old benchmarks:
add_8x8basis_c: 43.6 ( 1.00x)
add_8x8basis_ssse3: 12.3 ( 3.55x)
New benchmarks:
add_8x8basis_c: 43.0 ( 1.00x)
add_8x8basis_ssse3: 6.3 ( 6.79x)
Notice that the output of try_8x8basis_ssse3 changes a bit:
Before this commit, it computes certain values and adds the values
for i,i+1,i+4 and i+5 before right shifting them; now it adds
the values for i,i+1,i+8,i+9. The second pair in these lists
could be avoided (by shifting xmm0 and xmm1 before adding both together
instead of only shifting xmm0 after adding them), but the former
i,i+1 is inherent in using pmaddwd. This is the reason that this
function is not bitexact.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The only requirement of this code (and essentially the pmulhrsw
instruction) is that the scaled scale fits into an int16_t.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This loosens the coupling between CBS and the decoder by no longer using
CodedBitstreamH266Context (containing the most recently parsed PSs & PH)
to retrieve the PSs & PH in the decoder. Doing so is beneficial in two
ways:
1. It improves robustness to the case in which an AVPacket doesn't
contain precisely one PU.
2. It allows the decoder parameter set manager to properly handle the
case in which a single PU (erroneously) contains conflicting
parameter sets.
Signed-off-by: Frank Plowman <post@frankplowman.com>
Check only on arches that need said check.
(Btw: I do not see how h_loop_filter benefits from alignment
at all and why h_loop_filter_unaligned exists.)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The old code operated on bytes and did lots of tricks
due to their limited range; it did not completely succeed,
which is why the old versions were not used when bitexact
output was requested.
In contrast, the new version is much simpler: It operates
on signed 16 bit words whose range is more than sufficient.
This means that these functions don't need a check for bitexactness
(and can be used in FATE).
Old benchmarks (for this, the AV_CODEC_FLAG_BITEXACT check has been
removed from checkasm):
h_loop_filter_c: 29.8 ( 1.00x)
h_loop_filter_mmxext: 32.2 ( 0.93x)
h_loop_filter_unaligned_c: 29.9 ( 1.00x)
h_loop_filter_unaligned_mmxext: 31.4 ( 0.95x)
v_loop_filter_c: 39.3 ( 1.00x)
v_loop_filter_mmxext: 14.2 ( 2.78x)
v_loop_filter_unaligned_c: 38.9 ( 1.00x)
v_loop_filter_unaligned_mmxext: 14.3 ( 2.72x)
New benchmarks:
h_loop_filter_c: 29.2 ( 1.00x)
h_loop_filter_sse2: 28.6 ( 1.02x)
h_loop_filter_unaligned_c: 29.0 ( 1.00x)
h_loop_filter_unaligned_sse2: 26.9 ( 1.08x)
v_loop_filter_c: 38.3 ( 1.00x)
v_loop_filter_sse2: 11.0 ( 3.47x)
v_loop_filter_unaligned_c: 35.5 ( 1.00x)
v_loop_filter_unaligned_sse2: 11.2 ( 3.18x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This SSSE3 function uses MMX registers (of course without emms
at the end) and processes eight bytes of input by unpacking
it into two MMX registers. This is very suboptimal given
that one can just use XMM registers to process eight words.
This commit switches them to using XMM registers.
Old benchmarks:
avg_pixels_tab[1][3]_c: 114.5 ( 1.00x)
avg_pixels_tab[1][3]_ssse3: 43.6 ( 2.62x)
put_pixels_tab[1][3]_c: 83.6 ( 1.00x)
put_pixels_tab[1][3]_ssse3: 34.0 ( 2.46x)
New benchmarks:
avg_pixels_tab[1][3]_c: 115.3 ( 1.00x)
avg_pixels_tab[1][3]_ssse3: 24.6 ( 4.69x)
put_pixels_tab[1][3]_c: 83.8 ( 1.00x)
put_pixels_tab[1][3]_ssse3: 19.7 ( 4.24x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Given that one has to deal with 16 byte intermediates it is
unsurprising that SSE2 wins against MMX; the MMX version has
therefore been removed (as well as the now unused inline_asm.h).
The new function is even 32B smaller than the old MMX one.
Old benchmarks:
put_no_rnd_pixels_tab[1][3]_c: 84.1 ( 1.00x)
put_no_rnd_pixels_tab[1][3]_mmx: 41.1 ( 2.05x)
New benchmarks:
put_no_rnd_pixels_tab[1][3]_c: 84.0 ( 1.00x)
put_no_rnd_pixels_tab[1][3]_ssse3: 22.1 ( 3.80x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also remove the now superseded MMX versions (the new functions have the
exact same codesize as the removed ones).
Old benchmarks:
avg_no_rnd_pixels_tab[0][3]_c: 233.7 ( 1.00x)
avg_no_rnd_pixels_tab[0][3]_mmx: 121.5 ( 1.92x)
put_no_rnd_pixels_tab[0][3]_c: 171.4 ( 1.00x)
put_no_rnd_pixels_tab[0][3]_mmx: 82.6 ( 2.08x)
New benchmarks:
avg_no_rnd_pixels_tab[0][3]_c: 233.3 ( 1.00x)
avg_no_rnd_pixels_tab[0][3]_sse2: 45.0 ( 5.18x)
put_no_rnd_pixels_tab[0][3]_c: 172.1 ( 1.00x)
put_no_rnd_pixels_tab[0][3]_sse2: 40.9 ( 4.21x)
Reviewed-by: Kieran Kunhya <kieran@kunhya.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Hint: The parts of this patch in decode_block_progressive()
and decode_block_refinement() rely on the fact that GET_VLC
returns -1 on error, so that it enters the codepaths for
actually coded block coefficients.
Reviewed-by: Ramiro Polla <ramiro.polla@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>