Timo Rothenpieler
262d41c804
all: fix typos found by codespell
2025-08-03 13:48:47 +02:00
Andreas Rheinhardt
9b409ea1e6
configure: Factor mpegvideoencdsp out of mpegvideoenc
...
This will allow to relax the dependency on mpegvideoenc
for several codecs.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com >
2025-06-21 22:08:52 +02:00
Andreas Rheinhardt
20ddada2a3
avcodec/pixblockdsp: Improve 8 vs 16 bit check
...
Before this commit, the input in get_pixels and get_pixels_unaligned
has been treated inconsistenly:
- The generic code treated 9, 10, 12 and 14 bits as 16bit input
(these bits correspond to what FFmpeg's dsputils supported),
everything with <= 8 bits as 8 bit and everything else as 8 bit
when used via AVDCT (which exposes these functions and purports
to support up to 14 bits).
- AARCH64, ARM, PPC and RISC-V, x86 ignore this AVDCT special case.
- RISC-V also ignored the restriction to 9, 10, 12 and 14 for its
16bit check and treated everything > 8 bits as 16bit.
- The mmi MIPS code treats everything as 8 bit when used via
AVDCT (this is certainly broken); otherwise it checks for <= 8 bits.
The msa MIPS code behaves like the generic code.
This commit changes this to treat 9..16 bits as 16 bit input,
everything else as 8 bit (the former because it makes sense,
the latter to preserve the behaviour for external users*).
*: The only internal user of AVDCT (the spp filter) always
uses 8, 9 or 10 bits.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com >
2025-05-31 01:25:27 +02:00
Zhao Zhili
26752368f0
aarch64/h26x: Add put_hevc_pel_bi_w_pixels
...
On rpi5 (A76):
put_hevc_pel_bi_w_pixels4_8_c: 90.0 ( 1.00x)
put_hevc_pel_bi_w_pixels4_8_neon: 34.1 ( 2.64x)
put_hevc_pel_bi_w_pixels6_8_c: 188.3 ( 1.00x)
put_hevc_pel_bi_w_pixels6_8_neon: 73.5 ( 2.56x)
put_hevc_pel_bi_w_pixels8_8_c: 327.1 ( 1.00x)
put_hevc_pel_bi_w_pixels8_8_neon: 75.8 ( 4.32x)
put_hevc_pel_bi_w_pixels12_8_c: 728.8 ( 1.00x)
put_hevc_pel_bi_w_pixels12_8_neon: 186.1 ( 3.92x)
put_hevc_pel_bi_w_pixels16_8_c: 1288.1 ( 1.00x)
put_hevc_pel_bi_w_pixels16_8_neon: 268.5 ( 4.80x)
put_hevc_pel_bi_w_pixels24_8_c: 2855.5 ( 1.00x)
put_hevc_pel_bi_w_pixels24_8_neon: 723.8 ( 3.95x)
put_hevc_pel_bi_w_pixels32_8_c: 5095.3 ( 1.00x)
put_hevc_pel_bi_w_pixels32_8_neon: 1165.0 ( 4.37x)
put_hevc_pel_bi_w_pixels48_8_c: 11521.5 ( 1.00x)
put_hevc_pel_bi_w_pixels48_8_neon: 2856.0 ( 4.03x)
put_hevc_pel_bi_w_pixels64_8_c: 21020.5 ( 1.00x)
put_hevc_pel_bi_w_pixels64_8_neon: 4699.1 ( 4.47x)
Reviewed-by: Martin Storsjö <martin@martin.st >
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2025-04-29 15:24:14 +08:00
Zhao Zhili
39786f8cd5
aarch64/h26x: optimize sao_band_filter
...
int8_t[] is enough for offset_table of 8 bit streams.
On rpi5:
Before After
hevc_sao_band_8_8_c: 252.3 ( 1.00x) 252.3 ( 1.00x)
hevc_sao_band_8_8_neon: 95.8 ( 2.63x) 61.0 ( 4.57x)
hevc_sao_band_16_8_c: 875.2 ( 1.00x) 864.9 ( 1.00x)
hevc_sao_band_16_8_neon: 317.5 ( 2.76x) 150.0 ( 6.26x)
hevc_sao_band_32_8_c: 3853.5 ( 1.00x) 3871.6 ( 1.00x)
hevc_sao_band_32_8_neon: 1222.3 ( 3.15x) 550.6 ( 7.39)
hevc_sao_band_48_8_c: 8203.6 ( 1.00x) 8182.6 ( 1.00x)
hevc_sao_band_48_8_neon: 2685.7 ( 3.05x) 1185.8 ( 7.36x)
hevc_sao_band_64_8_c: 14023.0 ( 1.00x) 14038.9 ( 1.00x)
hevc_sao_band_64_8_neon: 4783.2 ( 2.93x) 2078.4 ( 7.15x)
Reviewed-by: Martin Storsjö <martin@martin.st >
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2025-04-29 15:11:45 +08:00
Andreas Rheinhardt
a064d34a32
avcodec/mpegvideoenc: Add MPVEncContext
...
Many of the fields of MpegEncContext (which is also used by decoders)
are actually only used by encoders. Therefore this commit adds
a new encoder-only structure and moves all of the encoder-only
fields to it except for those which require more explicit
synchronisation between the main slice context and the other
slice contexts. This synchronisation is currently mainly provided
by ff_update_thread_context() which simply copies most of
the main slice context over the other slice contexts. Fields
which are moved to the new MPVEncContext no longer participate
in this (which is desired, because it is horrible and for the
fields b) below wasteful) which means that some fields can only
be moved when explicit synchronisation code is added in later commits.
More explicitly, this commit moves the following fields:
a) Fields not copied by ff_update_duplicate_context():
dct_error_sum and dct_count; the former does not need synchronisation,
the latter is synchronised in merge_context_after_encode().
b) Fields which do not change after initialisation (these fields
could also be put into MPVMainEncContext at the cost of
an indirection to access them): lambda_table, adaptive_quant,
{luma,chroma}_elim_threshold, new_pic, fdsp, mpvencdsp, pdsp,
{p,b_forw,b_back,b_bidir_forw,b_bidir_back,b_direct,b_field}_mv_table,
[pb]_field_select_table, mb_{type,var,mean}, mc_mb_var, {min,max}_qcoeff,
{inter,intra}_quant_bias, ac_esc_length, the *_vlc_length fields,
the q_{intra,inter,chroma_intra}_matrix{,16}, dct_offset, mb_info,
mjpeg_ctx, rtp_mode, rtp_payload_size, encode_mb, all function
pointers, mpv_flags, quantizer_noise_shaping,
frame_reconstruction_bitfield, error_rate and intra_penalty.
c) Fields which are already (re)set explicitly: The PutBitContexts
pb, tex_pb, pb2; dquant, skipdct, encoding_error, the statistics
fields {mv,i_tex,p_tex,misc,last}_bits and i_count; last_mv_dir,
esc_pos (reset when writing the header).
d) Fields which are only used by encoders not supporting slice
threading for which synchronisation doesn't matter: esc3_level_length
and the remaining mb_info fields.
e) coded_score: This field is only really used when FF_MPV_FLAG_CBP_RD
is set (which implies trellis) and even then it is only used for
non-intra blocks. For these blocks dct_quantize_trellis_c() either
sets coded_score[n] or returns a last_non_zero value of -1
in which case coded_score will be reset in encode_mb_internal().
Therefore no old values are ever used.
The MotionEstContext has not been moved yet.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com >
2025-03-26 04:08:33 +01:00
Krzysztof Pyrkosz
f9b8f30680
avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
...
This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that.
Benchmarks before and after:
A78
avg_8_64x64_neon: 2686.2 ( 6.12x)
avg_8_128x128_neon: 10734.2 ( 5.88x)
avg_10_64x64_neon: 2536.8 ( 5.40x)
avg_10_128x128_neon: 10079.0 ( 5.22x)
avg_12_64x64_neon: 2548.2 ( 5.38x)
avg_12_128x128_neon: 10133.8 ( 5.19x)
avg_8_64x64_neon: 897.8 (18.26x)
avg_8_128x128_neon: 3608.5 (17.37x)
avg_10_32x32_neon: 444.2 ( 8.51x)
avg_10_64x64_neon: 1711.8 ( 8.00x)
avg_12_64x64_neon: 1706.2 ( 8.02x)
avg_12_128x128_neon: 7010.0 ( 7.46x)
A72
avg_8_64x64_neon: 5823.4 ( 3.88x)
avg_8_128x128_neon: 17430.5 ( 4.73x)
avg_10_64x64_neon: 5228.1 ( 3.71x)
avg_10_128x128_neon: 16722.2 ( 4.17x)
avg_12_64x64_neon: 5379.1 ( 3.51x)
avg_12_128x128_neon: 16715.7 ( 4.17x)
avg_8_64x64_neon: 2006.5 (10.61x)
avg_8_128x128_neon: 9158.7 ( 8.96x)
avg_10_64x64_neon: 3357.7 ( 5.60x)
avg_10_128x128_neon: 12411.7 ( 5.56x)
avg_12_64x64_neon: 3317.5 ( 5.67x)
avg_12_128x128_neon: 12358.5 ( 5.58x)
A53
avg_8_64x64_neon: 8327.8 ( 5.18x)
avg_8_128x128_neon: 31631.3 ( 5.34x)
avg_10_64x64_neon: 8783.5 ( 4.98x)
avg_10_128x128_neon: 32617.0 ( 5.25x)
avg_12_64x64_neon: 8686.0 ( 5.06x)
avg_12_128x128_neon: 32487.5 ( 5.25x)
avg_8_64x64_neon: 6032.3 ( 7.17x)
avg_8_128x128_neon: 22008.5 ( 7.69x)
avg_10_64x64_neon: 7738.0 ( 5.68x)
avg_10_128x128_neon: 27813.8 ( 6.14x)
avg_12_64x64_neon: 7844.5 ( 5.60x)
avg_12_128x128_neon: 26999.5 ( 6.34x)
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-03-07 15:51:20 +02:00
Zhao Zhili
3e9777dc75
aarch64/hevcdsp_idct_neon: Add implementation for idct dc 12
...
Reduce binary size at the same time. The performance compared to clang -O3
is the same.
Reviewed-by: Martin Storsjö <martin@martin.st >
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2025-03-04 17:01:58 +08:00
Zhao Zhili
5977bff569
aarch64/hevcdsp_idct_neon: Optimize idct dc
...
clang does better than the assembly code before the patch, especially
for small size:
hevc_idct_4x4_dc_8_c: 11.2 ( 1.00x)
hevc_idct_4x4_dc_8_neon: 15.5 ( 0.73x)
hevc_idct_4x4_dc_10_c: 12.0 ( 1.00x)
hevc_idct_4x4_dc_10_neon: 15.2 ( 0.79x)
hevc_idct_8x8_dc_8_c: 13.2 ( 1.00x)
hevc_idct_8x8_dc_8_neon: 18.2 ( 0.73x)
hevc_idct_8x8_dc_10_c: 13.5 ( 1.00x)
hevc_idct_8x8_dc_10_neon: 17.2 ( 0.78x)
hevc_idct_16x16_dc_8_c: 41.8 ( 1.00x)
hevc_idct_16x16_dc_8_neon: 37.8 ( 1.11x)
hevc_idct_16x16_dc_10_c: 41.8 ( 1.00x)
hevc_idct_16x16_dc_10_neon: 37.8 ( 1.11x)
hevc_idct_32x32_dc_8_c: 130.2 ( 1.00x)
hevc_idct_32x32_dc_8_neon: 132.2 ( 0.98x)
hevc_idct_32x32_dc_10_c: 130.2 ( 1.00x)
hevc_idct_32x32_dc_10_neon: 132.2 ( 0.98x)
This patch basically clone what the compiler does, so the performance
is the same.
Reviewed-by: Martin Storsjö <martin@martin.st >
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2025-03-04 17:01:58 +08:00
Krzysztof Pyrkosz
71a91485fa
avcodec/aarch64/vvc: Optimize NEON version of vvc_dmvr
...
This patch replaces blocks of instructions performing rounding and
widening shifts with one-liners achieving the same result.
Before and after on A78
dmvr_8_12x20_neon: 86.2 ( 6.90x)
dmvr_8_20x12_neon: 94.8 ( 5.93x)
dmvr_8_20x20_neon: 141.5 ( 6.50x)
dmvr_12_12x20_neon: 158.0 ( 3.76x)
dmvr_12_20x12_neon: 151.2 ( 3.73x)
dmvr_12_20x20_neon: 247.2 ( 3.71x)
dmvr_hv_8_12x20_neon: 423.2 ( 3.75x)
dmvr_hv_8_20x12_neon: 434.0 ( 3.69x)
dmvr_hv_8_20x20_neon: 706.0 ( 3.69x)
dmvr_8_12x20_neon: 77.2 ( 7.70x)
dmvr_8_20x12_neon: 66.5 ( 8.49x)
dmvr_8_20x20_neon: 92.2 ( 9.90x)
dmvr_12_12x20_neon: 80.2 ( 7.38x)
dmvr_12_20x12_neon: 58.2 ( 9.59x)
dmvr_12_20x20_neon: 90.0 (10.15x)
dmvr_hv_8_12x20_neon: 369.0 ( 4.34x)
dmvr_hv_8_20x12_neon: 355.8 ( 4.49x)
dmvr_hv_8_20x20_neon: 574.2 ( 4.51x)
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-03-04 10:35:31 +02:00
Krzysztof Pyrkosz
e8d4c55987
avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon
...
Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
these three at the end.
Before and after:
A78
ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x)
ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x)
A72
ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x)
ac3_sum_square_bufferfly_int32_neon: 527.3 ( 1.92x)
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-03-02 01:17:53 +02:00
Krzysztof Pyrkosz
9fb97215df
avcodec/aarch64/opusdsp_neon: Simplify opus_postfilter_neon
...
This change removes one extra floating point operation and simplifies
load operations at the beginning of the loop by using dedicated register
for each of the 5 pointers and interleaving it with calculations. The
first case seems to be a bit slower, but the performance increase is
substantial in the other two.
A78 before:
postfilter_15_neon: 1684.8 ( 4.23x)
postfilter_512_neon: 1395.5 ( 5.10x)
postfilter_1022_neon: 1357.0 ( 5.25x)
After:
postfilter_15_neon: 1742.2 ( 4.09x)
postfilter_512_neon: 1169.8 ( 6.09x)
postfilter_1022_neon: 1160.0 ( 6.12x)
A72 before:
postfilter_15_neon: 3144.8 ( 2.39x)
postfilter_512_neon: 3141.2 ( 2.39x)
postfilter_1022_neon: 3230.0 ( 2.33x)
After:
postfilter_15_neon: 2847.8 ( 2.64x)
postfilter_512_neon: 2877.8 ( 2.61x)
postfilter_1022_neon: 2837.2 ( 2.65x)
x13s before:
postfilter_15_neon: 1615.4 ( 2.61x)
postfilter_512_neon: 963.1 ( 4.39x)
postfilter_1022_neon: 963.6 ( 4.39x)
After:
postfilter_15_neon: 1749.6 ( 2.41x)
postfilter_512_neon: 707.1 ( 5.97x)
postfilter_1022_neon: 706.1 ( 5.99x)
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-02-10 14:55:16 +02:00
Krzysztof Pyrkosz
83e4b068d9
avcodec/aarch64/aacencdsp: NEON implementation
...
This patch supplies handwritten NEON code for AAC.
The benchmarks below were collected by invoking these two commands on
each of my boards, A78, A72 and Thinkpad x13s:
1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12
2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is
speed without the patch, second, with)
- A78
abs_pow34_c: 4161.5 ( 1.00x)
abs_pow34_neon: 3586.2 ( 1.16x)
quant_bands_signed_c: 5548.0 ( 1.00x)
quant_bands_signed_neon: 1126.8 ( 4.92x)
quant_bands_unsigned_c: 3979.2 ( 1.00x)
quant_bands_unsigned_neon: 800.2 ( 4.97x)
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=71.6x
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=82.3x
- A72
abs_pow34_c: 15362.2 ( 1.00x)
abs_pow34_neon: 15382.5 ( 1.00x)
quant_bands_signed_c: 9926.5 ( 1.00x)
quant_bands_signed_neon: 2467.8 ( 4.02x)
quant_bands_unsigned_c: 5469.8 ( 1.00x)
quant_bands_unsigned_neon: 2089.5 ( 2.62x)
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=34.3x
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=37.8
- x13s
abs_pow34_c: 2413.4 ( 1.00x)
abs_pow34_neon: 1796.2 ( 1.34x)
quant_bands_signed_c: 2968.9 ( 1.00x)
quant_bands_signed_neon: 675.6 ( 4.39x)
quant_bands_unsigned_c: 2311.9 ( 1.00x)
quant_bands_unsigned_neon: 477.1 ( 4.85x)
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 135x
size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 159x
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-01-28 10:44:40 +02:00
Janne Grunau
430c38f698
aarch64: vp9mc: Load only 12 pixels in the 4 pixel wide horizontal filter
...
This reduces the amount the horizontal filters read beyond the filter
width to a consistent 1 pixel. The data is not used so this is usually
not noticeable. It becomes a problem when the application allocates
frame buffers only for the aligned picture size and the end of it is at
a page boundary. This happens for picture sizes which are a multiple of
the page size like 1280x640. The frame buffer allocation is based on
its most likely done via mmap + MAP_ANONYMOUS so start and end of the
buffer are page aligned and the previous and next page are not
necessarily mapped.
Under these conditions like seen by Firefox a read beyond the end of the
buffer results in a segfault.
After the over-read is reduced to a single pixel it's reasonable to use
VP9's emulated edge motion compensation for this.
Fixes: https://bugzilla.mozilla.org/show_bug.cgi?id=1881185
Signed-off-by: Janne Grunau <janne-ffmpeg@jannau.net >
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com >
2025-01-03 17:53:46 -05:00
Zhao Zhili
952508ae05
aarch64/vvc: Add apply_bdof
...
Test on rpi 5 with gcc 12:
apply_bdof_8_8x16_c: 7315.2 ( 1.00x)
apply_bdof_8_8x16_neon: 1876.8 ( 3.90x)
apply_bdof_8_16x8_c: 7170.5 ( 1.00x)
apply_bdof_8_16x8_neon: 1752.8 ( 4.09x)
apply_bdof_8_16x16_c: 14695.2 ( 1.00x)
apply_bdof_8_16x16_neon: 3490.5 ( 4.21x)
apply_bdof_10_8x16_c: 7371.5 ( 1.00x)
apply_bdof_10_8x16_neon: 1863.8 ( 3.96x)
apply_bdof_10_16x8_c: 7172.0 ( 1.00x)
apply_bdof_10_16x8_neon: 1766.0 ( 4.06x)
apply_bdof_10_16x16_c: 14551.5 ( 1.00x)
apply_bdof_10_16x16_neon: 3576.0 ( 4.07x)
apply_bdof_12_8x16_c: 7236.5 ( 1.00x)
apply_bdof_12_8x16_neon: 1863.8 ( 3.88x)
apply_bdof_12_16x8_c: 7316.5 ( 1.00x)
apply_bdof_12_16x8_neon: 1758.8 ( 4.16x)
apply_bdof_12_16x16_c: 14691.2 ( 1.00x)
apply_bdof_12_16x16_neon: 3480.5 ( 4.22x)
2024-12-21 11:54:44 +08:00
Martin Storsjö
2bb00ef59c
aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions
...
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.
This fixes these build errors:
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
error A2518: operand 2: Memory offset must be aligned
ldr s5, [x1, #1 ]
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
error A2518: operand 2: Memory offset must be aligned
ldr d7, [x1, #2 ]
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-12-18 13:45:09 +02:00
Bin Peng
72a3656e84
lavc/aarch64: Fix ff_pred16x16_plane_neon_10
...
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 367840
Signed-off-by: Peng Bin <pengbin@visionular.com >
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-12-17 14:50:29 +02:00
Bin Peng
decc9e643c
lavc/aarch64: Fix ff_pred8x8_plane_neon_10
...
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 479612
The mismatch between neon and C functions can also be reproduced using the following bitstream and command line.
wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264
./ffmpeg -cpuflags 0 -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_ref
./ffmpeg -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_neon
Signed-off-by: Bin Peng <pengbin@visionular.com >
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-12-17 14:50:29 +02:00
Zhao Zhili
40feba5f77
aarch64/vvc: Fix clip in alf
...
Fix test failure:
./tests/checkasm/checkasm --test=vvc_alf 3607569773
2024-12-10 21:00:47 +08:00
Zhao Zhili
91436638de
aarch64/vvc: Use faster clip operation
...
Replace sqxtn+smin+smax by sqxtun+umin.
2024-12-10 21:00:47 +08:00
Zhao Zhili
bfed5f6b7d
aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma
2024-12-10 21:00:47 +08:00
Zhao Zhili
5988a2729b
aarch64/vvc: Add dmvr
...
dmvr_8_12x20_c: 1.5 ( 1.00x)
dmvr_8_12x20_neon: 0.2 ( 6.56x)
dmvr_8_20x12_c: 1.0 ( 1.00x)
dmvr_8_20x12_neon: 0.2 ( 4.33x)
dmvr_8_20x20_c: 1.7 ( 1.00x)
dmvr_8_20x20_neon: 0.5 ( 3.63x)
dmvr_12_12x20_c: 2.2 ( 1.00x)
dmvr_12_12x20_neon: 0.5 ( 4.68x)
dmvr_12_20x12_c: 2.0 ( 1.00x)
dmvr_12_20x12_neon: 0.5 ( 4.16x)
dmvr_12_20x20_c: 3.7 ( 1.00x)
dmvr_12_20x20_neon: 0.7 ( 5.14x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-10-01 10:28:54 +08:00
Zhao Zhili
bcd65ebd8f
aarch64/vvc: Add dmvr_hv
...
dmvr_hv_8_12x20_c: 8.0 ( 1.00x)
dmvr_hv_8_12x20_neon: 1.2 ( 6.62x)
dmvr_hv_8_20x12_c: 8.0 ( 1.00x)
dmvr_hv_8_20x12_neon: 0.9 ( 8.37x)
dmvr_hv_8_20x20_c: 12.9 ( 1.00x)
dmvr_hv_8_20x20_neon: 1.7 ( 7.62x)
dmvr_hv_10_12x20_c: 7.0 ( 1.00x)
dmvr_hv_10_12x20_neon: 1.7 ( 4.09x)
dmvr_hv_10_20x12_c: 7.0 ( 1.00x)
dmvr_hv_10_20x12_neon: 1.7 ( 4.09x)
dmvr_hv_10_20x20_c: 11.2 ( 1.00x)
dmvr_hv_10_20x20_neon: 2.7 ( 4.15x)
dmvr_hv_12_12x20_c: 6.5 ( 1.00x)
dmvr_hv_12_12x20_neon: 1.7 ( 3.79x)
dmvr_hv_12_20x12_c: 6.5 ( 1.00x)
dmvr_hv_12_20x12_neon: 1.7 ( 3.79x)
dmvr_hv_12_20x20_c: 10.2 ( 1.00x)
dmvr_hv_12_20x20_neon: 2.2 ( 4.64x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-10-01 10:28:54 +08:00
Zhao Zhili
0ba9e8d0d4
aarch64/vvc: Add w_avg
...
w_avg_8_2x2_c: 0.0 ( 0.00x)
w_avg_8_2x2_neon: 0.0 ( 0.00x)
w_avg_8_4x4_c: 0.2 ( 1.00x)
w_avg_8_4x4_neon: 0.0 ( 0.00x)
w_avg_8_8x8_c: 1.2 ( 1.00x)
w_avg_8_8x8_neon: 0.2 ( 5.00x)
w_avg_8_16x16_c: 4.2 ( 1.00x)
w_avg_8_16x16_neon: 0.8 ( 5.67x)
w_avg_8_32x32_c: 16.2 ( 1.00x)
w_avg_8_32x32_neon: 2.5 ( 6.50x)
w_avg_8_64x64_c: 64.5 ( 1.00x)
w_avg_8_64x64_neon: 9.0 ( 7.17x)
w_avg_8_128x128_c: 269.5 ( 1.00x)
w_avg_8_128x128_neon: 35.5 ( 7.59x)
w_avg_10_2x2_c: 0.2 ( 1.00x)
w_avg_10_2x2_neon: 0.2 ( 1.00x)
w_avg_10_4x4_c: 0.2 ( 1.00x)
w_avg_10_4x4_neon: 0.2 ( 1.00x)
w_avg_10_8x8_c: 1.0 ( 1.00x)
w_avg_10_8x8_neon: 0.2 ( 4.00x)
w_avg_10_16x16_c: 4.2 ( 1.00x)
w_avg_10_16x16_neon: 0.8 ( 5.67x)
w_avg_10_32x32_c: 16.2 ( 1.00x)
w_avg_10_32x32_neon: 2.5 ( 6.50x)
w_avg_10_64x64_c: 66.2 ( 1.00x)
w_avg_10_64x64_neon: 10.0 ( 6.62x)
w_avg_10_128x128_c: 277.8 ( 1.00x)
w_avg_10_128x128_neon: 39.8 ( 6.99x)
w_avg_12_2x2_c: 0.0 ( 0.00x)
w_avg_12_2x2_neon: 0.2 ( 0.00x)
w_avg_12_4x4_c: 0.2 ( 1.00x)
w_avg_12_4x4_neon: 0.0 ( 0.00x)
w_avg_12_8x8_c: 1.2 ( 1.00x)
w_avg_12_8x8_neon: 0.5 ( 2.50x)
w_avg_12_16x16_c: 4.8 ( 1.00x)
w_avg_12_16x16_neon: 0.8 ( 6.33x)
w_avg_12_32x32_c: 17.0 ( 1.00x)
w_avg_12_32x32_neon: 2.8 ( 6.18x)
w_avg_12_64x64_c: 64.0 ( 1.00x)
w_avg_12_64x64_neon: 10.0 ( 6.40x)
w_avg_12_128x128_c: 269.2 ( 1.00x)
w_avg_12_128x128_neon: 42.0 ( 6.41x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-10-01 10:28:54 +08:00
Martin Storsjö
a3ec1f8c6c
aarch64: h26x: Fix the indentation of one function
...
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-09-26 13:42:11 +03:00
Zhao Zhili
3f84d1d1fb
aarch64/vvc: Add avg
...
avg_8_2x2_c: 0.2 ( 1.00x)
avg_8_2x2_neon: 0.2 ( 1.00x)
avg_8_4x4_c: 0.2 ( 1.00x)
avg_8_4x4_neon: 0.2 ( 1.00x)
avg_8_8x8_c: 0.9 ( 1.00x)
avg_8_8x8_neon: 0.2 ( 5.29x)
avg_8_16x16_c: 3.7 ( 1.00x)
avg_8_16x16_neon: 0.7 ( 5.44x)
avg_8_32x32_c: 14.9 ( 1.00x)
avg_8_32x32_neon: 1.7 ( 8.91x)
avg_8_64x64_c: 59.7 ( 1.00x)
avg_8_64x64_neon: 6.9 ( 8.62x)
avg_8_128x128_c: 254.7 ( 1.00x)
avg_8_128x128_neon: 26.9 ( 9.46x)
avg_10_2x2_c: 0.2 ( 1.00x)
avg_10_2x2_neon: 0.2 ( 1.00x)
avg_10_4x4_c: 0.2 ( 1.00x)
avg_10_4x4_neon: 0.2 ( 1.00x)
avg_10_8x8_c: 0.9 ( 1.00x)
avg_10_8x8_neon: 0.2 ( 5.29x)
avg_10_16x16_c: 3.4 ( 1.00x)
avg_10_16x16_neon: 0.4 ( 8.06x)
avg_10_32x32_c: 13.9 ( 1.00x)
avg_10_32x32_neon: 1.9 ( 7.23x)
avg_10_64x64_c: 54.2 ( 1.00x)
avg_10_64x64_neon: 8.4 ( 6.43x)
avg_10_128x128_c: 232.4 ( 1.00x)
avg_10_128x128_neon: 30.9 ( 7.52x)
avg_12_2x2_c: 0.0 ( 0.00x)
avg_12_2x2_neon: 0.2 ( 0.00x)
avg_12_4x4_c: 0.4 ( 1.00x)
avg_12_4x4_neon: 0.2 ( 2.43x)
avg_12_8x8_c: 0.7 ( 1.00x)
avg_12_8x8_neon: 0.2 ( 3.86x)
avg_12_16x16_c: 3.7 ( 1.00x)
avg_12_16x16_neon: 0.4 ( 8.65x)
avg_12_32x32_c: 13.7 ( 1.00x)
avg_12_32x32_neon: 2.2 ( 6.29x)
avg_12_64x64_c: 53.9 ( 1.00x)
avg_12_64x64_neon: 7.7 ( 7.03x)
avg_12_128x128_c: 270.9 ( 1.00x)
avg_12_128x128_neon: 30.4 ( 8.90x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
1be5a2374f
aarch64/vvc: Add put_epel_hv
...
On Apple M1:
put_chroma_hv_8_4x4_c: 1.7 ( 1.00x)
put_chroma_hv_8_4x4_neon: 0.2 ( 7.67x)
put_chroma_hv_8_8x8_c: 5.5 ( 1.00x)
put_chroma_hv_8_8x8_neon: 0.5 (11.53x)
put_chroma_hv_8_16x16_c: 18.5 ( 1.00x)
put_chroma_hv_8_16x16_neon: 1.5 (12.53x)
put_chroma_hv_8_32x32_c: 72.5 ( 1.00x)
put_chroma_hv_8_32x32_neon: 4.7 (15.34x)
put_chroma_hv_8_64x64_c: 274.0 ( 1.00x)
put_chroma_hv_8_64x64_neon: 18.5 (14.83x)
put_chroma_hv_8_128x128_c: 1058.7 ( 1.00x)
put_chroma_hv_8_128x128_neon: 75.2 (14.07x)
On Android Pixel 8 Pro:
put_chroma_hv_8_4x4_c: 1.2 ( 1.00x)
put_chroma_hv_8_4x4_neon: 0.0 ( 0.00x)
put_chroma_hv_8_4x4_i8mm: 0.2 ( 5.00x)
put_chroma_hv_8_8x8_c: 4.0 ( 1.00x)
put_chroma_hv_8_8x8_neon: 0.5 ( 8.00x)
put_chroma_hv_8_8x8_i8mm: 0.5 ( 8.00x)
put_chroma_hv_8_16x16_c: 15.2 ( 1.00x)
put_chroma_hv_8_16x16_neon: 2.5 ( 6.10x)
put_chroma_hv_8_16x16_i8mm: 2.2 ( 6.78x)
put_chroma_hv_8_32x32_c: 61.0 ( 1.00x)
put_chroma_hv_8_32x32_neon: 9.8 ( 6.26x)
put_chroma_hv_8_32x32_i8mm: 8.5 ( 7.18x)
put_chroma_hv_8_64x64_c: 229.5 ( 1.00x)
put_chroma_hv_8_64x64_neon: 38.5 ( 5.96x)
put_chroma_hv_8_64x64_i8mm: 34.0 ( 6.75x)
put_chroma_hv_8_128x128_c: 919.8 ( 1.00x)
put_chroma_hv_8_128x128_neon: 154.5 ( 5.95x)
put_chroma_hv_8_128x128_i8mm: 140.0 ( 6.57x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
0dcf204e5d
aarch64/vvc: Add put_epel_h i8mm
...
put_chroma_h_8_4x4_c: 0.4 ( 1.00x)
put_chroma_h_8_4x4_neon: 0.0 ( 0.00x)
put_chroma_h_8_4x4_i8mm: 0.1 ( 2.67x)
put_chroma_h_8_8x8_c: 1.6 ( 1.00x)
put_chroma_h_8_8x8_neon: 0.1 (11.00x)
put_chroma_h_8_8x8_i8mm: 0.1 (11.00x)
put_chroma_h_8_16x16_c: 6.9 ( 1.00x)
put_chroma_h_8_16x16_neon: 1.1 ( 6.00x)
put_chroma_h_8_16x16_i8mm: 0.7 (10.62x)
put_chroma_h_8_32x32_c: 27.6 ( 1.00x)
put_chroma_h_8_32x32_neon: 4.7 ( 5.95x)
put_chroma_h_8_32x32_i8mm: 4.4 ( 6.28x)
put_chroma_h_8_64x64_c: 116.2 ( 1.00x)
put_chroma_h_8_64x64_neon: 19.1 ( 6.07x)
put_chroma_h_8_64x64_i8mm: 17.1 ( 6.77x)
put_chroma_h_8_128x128_c: 466.6 ( 1.00x)
put_chroma_h_8_128x128_neon: 81.4 ( 5.73x)
put_chroma_h_8_128x128_i8mm: 71.7 ( 6.51x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
41a1885f7a
aarch64/vvc: Add put_epel_h
...
put_chroma_h_8_4x4_c: 0.2 ( 1.00x)
put_chroma_h_8_4x4_neon: 0.2 ( 1.00x)
put_chroma_h_8_8x8_c: 0.8 ( 1.00x)
put_chroma_h_8_8x8_neon: 0.2 ( 3.00x)
put_chroma_h_8_16x16_c: 3.8 ( 1.00x)
put_chroma_h_8_16x16_neon: 0.8 ( 5.00x)
put_chroma_h_8_32x32_c: 12.5 ( 1.00x)
put_chroma_h_8_32x32_neon: 2.2 ( 5.56x)
put_chroma_h_8_64x64_c: 47.0 ( 1.00x)
put_chroma_h_8_64x64_neon: 8.8 ( 5.37x)
put_chroma_h_8_128x128_c: 200.2 ( 1.00x)
put_chroma_h_8_128x128_neon: 31.8 ( 6.31x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
260e1b4b62
aarch64/vvc: Add sad
...
sad_8x16_c: 0.8 ( 1.00x)
sad_8x16_neon: 0.2 ( 3.00x)
sad_16x8_c: 0.5 ( 1.00x)
sad_16x8_neon: 0.2 ( 2.00x)
sad_16x16_c: 1.5 ( 1.00x)
sad_16x16_neon: 0.2 ( 6.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
5ac6925803
aarch64/vvc: Add put_qpel_hv
...
With Apple M1 (no i8mm):
put_luma_hv_8_4x4_c: 2.2 ( 1.00x)
put_luma_hv_8_4x4_neon: 0.8 ( 3.00x)
put_luma_hv_8_8x8_c: 7.0 ( 1.00x)
put_luma_hv_8_8x8_neon: 0.8 ( 9.33x)
put_luma_hv_8_16x16_c: 22.8 ( 1.00x)
put_luma_hv_8_16x16_neon: 2.5 ( 9.10x)
put_luma_hv_8_32x32_c: 84.8 ( 1.00x)
put_luma_hv_8_32x32_neon: 9.5 ( 8.92x)
put_luma_hv_8_64x64_c: 333.0 ( 1.00x)
put_luma_hv_8_64x64_neon: 35.5 ( 9.38x)
put_luma_hv_8_128x128_c: 1294.5 ( 1.00x)
put_luma_hv_8_128x128_neon: 137.8 ( 9.40x)
With Pixel 8 Pro:
put_luma_hv_8_4x4_c: 5.0 ( 1.00x)
put_luma_hv_8_4x4_neon: 0.8 ( 6.67x)
put_luma_hv_8_4x4_i8mm: 0.2 (20.00x)
put_luma_hv_8_8x8_c: 13.2 ( 1.00x)
put_luma_hv_8_8x8_neon: 1.2 (10.60x)
put_luma_hv_8_8x8_i8mm: 1.2 (10.60x)
put_luma_hv_8_16x16_c: 44.2 ( 1.00x)
put_luma_hv_8_16x16_neon: 4.5 ( 9.83x)
put_luma_hv_8_16x16_i8mm: 4.2 (10.41x)
put_luma_hv_8_32x32_c: 160.8 ( 1.00x)
put_luma_hv_8_32x32_neon: 17.5 ( 9.19x)
put_luma_hv_8_32x32_i8mm: 16.0 (10.05x)
put_luma_hv_8_64x64_c: 611.2 ( 1.00x)
put_luma_hv_8_64x64_neon: 68.0 ( 8.99x)
put_luma_hv_8_64x64_i8mm: 62.2 ( 9.82x)
put_luma_hv_8_128x128_c: 2384.8 ( 1.00x)
put_luma_hv_8_128x128_neon: 268.8 ( 8.87x)
put_luma_hv_8_128x128_i8mm: 245.8 ( 9.70x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
a0b52afd32
aarch64/vvc: Add put_qpel_vx
...
put_luma_v_8_4x4_c: 1.0 ( 1.00x)
put_luma_v_8_4x4_neon: 0.0 ( 0.00x)
put_luma_v_8_8x8_c: 3.5 ( 1.00x)
put_luma_v_8_8x8_neon: 0.5 ( 7.00x)
put_luma_v_8_16x16_c: 13.8 ( 1.00x)
put_luma_v_8_16x16_neon: 1.2 (11.00x)
put_luma_v_8_32x32_c: 54.2 ( 1.00x)
put_luma_v_8_32x32_neon: 5.0 (10.85x)
put_luma_v_8_64x64_c: 217.5 ( 1.00x)
put_luma_v_8_64x64_neon: 18.8 (11.60x)
put_luma_v_8_128x128_c: 886.2 ( 1.00x)
put_luma_v_8_128x128_neon: 74.0 (11.98x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
b051bc7cb8
aarch64/h26x: Remove duplicate b.eq instruction
...
b.eq is added by calc_all after each calc.
2024-09-14 16:36:34 +08:00
Zhao Zhili
9f6c8eb412
aarch64/vvc: Add put_qpel_hx i8mm
...
Benchmark on Android pixel 8 with -fno-vectorize
put_luma_h_8_4x4_c: 0.2 ( 1.00x)
put_luma_h_8_4x4_neon: 0.2 ( 1.00x)
put_luma_h_8_4x4_i8mm: 0.0 ( 0.00x)
put_luma_h_8_8x8_c: 1.5 ( 1.00x)
put_luma_h_8_8x8_neon: 0.5 ( 3.00x)
put_luma_h_8_8x8_i8mm: 0.5 ( 3.00x)
put_luma_h_8_16x16_c: 6.2 ( 1.00x)
put_luma_h_8_16x16_neon: 2.0 ( 3.12x)
put_luma_h_8_16x16_i8mm: 1.5 ( 4.17x)
put_luma_h_8_32x32_c: 25.5 ( 1.00x)
put_luma_h_8_32x32_neon: 9.0 ( 2.83x)
put_luma_h_8_32x32_i8mm: 6.8 ( 3.78x)
put_luma_h_8_64x64_c: 99.8 ( 1.00x)
put_luma_h_8_64x64_neon: 35.2 ( 2.83x)
put_luma_h_8_64x64_i8mm: 27.2 ( 3.66x)
put_luma_h_8_128x128_c: 422.0 ( 1.00x)
put_luma_h_8_128x128_neon: 138.5 ( 3.05x)
put_luma_h_8_128x128_i8mm: 109.2 ( 3.86x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
25448d1716
aarch64/vvc: Add put_pel/put_pel_uni/put_pel_uni_w
...
put_luma_pixels_8_4x4_c: 0.2 ( 1.00x)
put_luma_pixels_8_4x4_neon: 0.2 ( 1.00x)
put_luma_pixels_8_8x8_c: 0.7 ( 1.00x)
put_luma_pixels_8_8x8_neon: 0.2 ( 3.22x)
put_luma_pixels_8_16x16_c: 2.2 ( 1.00x)
put_luma_pixels_8_16x16_neon: 0.2 ( 9.89x)
put_luma_pixels_8_32x32_c: 8.2 ( 1.00x)
put_luma_pixels_8_32x32_neon: 1.2 ( 6.71x)
put_luma_pixels_8_64x64_c: 33.7 ( 1.00x)
put_luma_pixels_8_64x64_neon: 2.5 (13.63x)
put_luma_pixels_8_128x128_c: 145.5 ( 1.00x)
put_luma_pixels_8_128x128_neon: 10.2 (14.23x)
put_uni_pixels_luma_8_4x4_c: 0.5 ( 1.00x)
put_uni_pixels_luma_8_4x4_neon: 0.0 ( 0.00x)
put_uni_pixels_luma_8_8x8_c: 0.5 ( 1.00x)
put_uni_pixels_luma_8_8x8_neon: 0.2 ( 2.11x)
put_uni_pixels_luma_8_16x16_c: 1.2 ( 1.00x)
put_uni_pixels_luma_8_16x16_neon: 0.2 ( 5.44x)
put_uni_pixels_luma_8_32x32_c: 3.0 ( 1.00x)
put_uni_pixels_luma_8_32x32_neon: 0.5 ( 6.26x)
put_uni_pixels_luma_8_64x64_c: 3.0 ( 1.00x)
put_uni_pixels_luma_8_64x64_neon: 1.7 ( 1.72x)
put_uni_pixels_luma_8_128x128_c: 6.5 ( 1.00x)
put_uni_pixels_luma_8_128x128_neon: 6.5 ( 1.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
20f2bf5530
aarch64/vvc: Add put_qpel_h_* and put_qpel_uni_h_*
...
Just share hevc implementation.
checkasm --test=vvc_mc --benchmark:
put_luma_h_8_4x4_c: 0.2 ( 1.00x)
put_luma_h_8_4x4_neon: 0.2 ( 1.00x)
put_luma_h_8_8x8_c: 1.0 ( 1.00x)
put_luma_h_8_8x8_neon: 0.2 ( 4.33x)
put_luma_h_8_16x16_c: 3.2 ( 1.00x)
put_luma_h_8_16x16_neon: 1.2 ( 2.63x)
put_luma_h_8_32x32_c: 13.7 ( 1.00x)
put_luma_h_8_32x32_neon: 4.0 ( 3.45x)
put_luma_h_8_64x64_c: 48.2 ( 1.00x)
put_luma_h_8_64x64_neon: 15.7 ( 3.07x)
put_luma_h_8_128x128_c: 203.5 ( 1.00x)
put_luma_h_8_128x128_neon: 62.0 ( 3.28x)
put_uni_h_luma_8_4x4_c: 0.2 ( 1.00x)
put_uni_h_luma_8_4x4_neon: 0.2 ( 1.00x)
put_uni_h_luma_8_8x8_c: 1.5 ( 1.00x)
put_uni_h_luma_8_8x8_neon: 0.2 ( 6.56x)
put_uni_h_luma_8_16x16_c: 5.7 ( 1.00x)
put_uni_h_luma_8_16x16_neon: 1.2 ( 4.67x)
put_uni_h_luma_8_32x32_c: 24.0 ( 1.00x)
put_uni_h_luma_8_32x32_neon: 4.7 ( 5.07x)
put_uni_h_luma_8_64x64_c: 90.0 ( 1.00x)
put_uni_h_luma_8_64x64_neon: 17.0 ( 5.30x)
put_uni_h_luma_8_128x128_c: 357.7 ( 1.00x)
put_uni_h_luma_8_128x128_neon: 67.5 ( 5.30x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
46f07ce7d1
aarch64/hevc: Move epel/qpel to h26x directory
...
So vvc can reuse the implementation.
2024-09-14 16:36:34 +08:00
Zhao Zhili
8beafb5656
aarch64/hevc: Simplify function prototypes by macro
2024-09-14 16:36:34 +08:00
Anton Khirnov
3f9ca51015
lavc/opus*: move to opus/ subdir
2024-09-02 11:56:53 +02:00
Ramiro Polla
6aafe61285
avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t
2024-09-01 13:42:30 +02:00
Zhao Zhili
4c0372281b
aarch64/vvc: Bind h26x/sao filter implementation to vvc
...
Reviewed-by: Martin Storsjö <martin@martin.st >
2024-08-31 16:07:50 +08:00
Zhao Zhili
8cc10298a7
aarch64/hevc: Move sao to h26x directory
...
So vvc can reuse the implementation.
Reviewed-by: Martin Storsjö <martin@martin.st >
2024-08-31 16:07:43 +08:00
Ramiro Polla
8c203ea7c7
avcodec/aarch64/mpegvideoencdsp: add dotprod implementation for pix_norm1
...
A55 A76
pix_norm1_c: 484.3 235.2
pix_norm1_neon: 193.8 ( 2.50x) 44.7 ( 5.26x)
pix_norm1_dotprod: 91.8 ( 5.28x) 21.2 (11.09x)
2024-08-26 12:49:04 +02:00
Ramiro Polla
9f68a3712e
avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
...
A55 A76
pix_norm1_c: 478.2 234.2
pix_norm1_neon: 188.2 ( 2.54x) 41.2 ( 5.68x)
pix_sum_c: 304.2 244.0
pix_sum_neon: 77.2 ( 3.94x) 21.5 (11.35x)
2024-08-26 12:48:31 +02:00
Ramiro Polla
5c1c0325cd
avcodec/aarch64/me_cmp: add dotprod implementations of sse16 and vsse_intra16
...
checkasm --bench for Raspberry Pi 5 Model B Rev 1.0:
sse_0_c: 241.5
sse_0_neon: 37.2
sse_0_dotprod: 22.2
vsse_4_c: 148.7
vsse_4_neon: 31.0
vsse_4_dotprod: 15.7
2024-08-17 15:31:48 +02:00
Martin Storsjö
4acb9b7d10
aarch64: vvc: Fix unnecessary extra spaces
...
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-07-23 16:04:28 +03:00
Martin Storsjö
99598629e8
aarch64: vvc: Consistently use # for immediate constants
...
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-07-23 15:24:37 +03:00
Martin Storsjö
400843151d
aarch64: vvc: Fix compilation of alf.S with MSVC 2022 17.7 and older
...
Use the "ldur" instruction explicitly, instead of having the
assembler implicitly convert "ldr" instructions to "ldur".
This fixes build errors like these:
libavcodec\aarch64\vvc\alf.o.asm(1023) : error A2518: operand 2: Memory offset must be aligned
ldr q22, [x3, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1024) : error A2518: operand 2: Memory offset must be aligned
ldr q24, [x2, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1393) : error A2518: operand 2: Memory offset must be aligned
ldr q22, [x3, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1394) : error A2518: operand 2: Memory offset must be aligned
ldr q24, [x2, #24 ]
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-07-23 15:24:33 +03:00
Zhao Zhili
2d4ef304c9
avcodec/vvc: Add aarch64 neon optimization for ALF
...
vvc_alf_filter_chroma_4x4_8_c: 3.0
vvc_alf_filter_chroma_4x4_8_neon: 1.0
vvc_alf_filter_chroma_4x4_10_c: 2.7
vvc_alf_filter_chroma_4x4_10_neon: 1.0
vvc_alf_filter_chroma_4x4_12_c: 2.7
vvc_alf_filter_chroma_4x4_12_neon: 1.0
vvc_alf_filter_chroma_8x8_8_c: 10.2
vvc_alf_filter_chroma_8x8_8_neon: 3.0
vvc_alf_filter_chroma_8x8_10_c: 10.0
vvc_alf_filter_chroma_8x8_10_neon: 2.5
vvc_alf_filter_chroma_8x8_12_c: 10.0
vvc_alf_filter_chroma_8x8_12_neon: 2.5
vvc_alf_filter_chroma_16x16_8_c: 41.7
vvc_alf_filter_chroma_16x16_8_neon: 11.2
vvc_alf_filter_chroma_16x16_10_c: 39.0
vvc_alf_filter_chroma_16x16_10_neon: 10.0
vvc_alf_filter_chroma_16x16_12_c: 40.2
vvc_alf_filter_chroma_16x16_12_neon: 10.2
vvc_alf_filter_chroma_32x32_8_c: 162.0
vvc_alf_filter_chroma_32x32_8_neon: 45.0
vvc_alf_filter_chroma_32x32_10_c: 155.5
vvc_alf_filter_chroma_32x32_10_neon: 39.5
vvc_alf_filter_chroma_32x32_12_c: 155.5
vvc_alf_filter_chroma_32x32_12_neon: 40.0
vvc_alf_filter_chroma_64x64_8_c: 646.0
vvc_alf_filter_chroma_64x64_8_neon: 175.5
vvc_alf_filter_chroma_64x64_10_c: 708.2
vvc_alf_filter_chroma_64x64_10_neon: 166.7
vvc_alf_filter_chroma_64x64_12_c: 619.2
vvc_alf_filter_chroma_64x64_12_neon: 157.2
vvc_alf_filter_chroma_128x128_8_c: 2611.5
vvc_alf_filter_chroma_128x128_8_neon: 698.2
vvc_alf_filter_chroma_128x128_10_c: 2470.0
vvc_alf_filter_chroma_128x128_10_neon: 616.0
vvc_alf_filter_chroma_128x128_12_c: 2531.5
vvc_alf_filter_chroma_128x128_12_neon: 620.2
vvc_alf_filter_luma_8x8_8_c: 25.2
vvc_alf_filter_luma_8x8_8_neon: 4.2
vvc_alf_filter_luma_8x8_10_c: 18.5
vvc_alf_filter_luma_8x8_10_neon: 4.0
vvc_alf_filter_luma_8x8_12_c: 19.0
vvc_alf_filter_luma_8x8_12_neon: 4.0
vvc_alf_filter_luma_16x16_8_c: 106.5
vvc_alf_filter_luma_16x16_8_neon: 16.2
vvc_alf_filter_luma_16x16_10_c: 75.2
vvc_alf_filter_luma_16x16_10_neon: 14.7
vvc_alf_filter_luma_16x16_12_c: 79.7
vvc_alf_filter_luma_16x16_12_neon: 14.7
vvc_alf_filter_luma_32x32_8_c: 400.5
vvc_alf_filter_luma_32x32_8_neon: 63.2
vvc_alf_filter_luma_32x32_10_c: 299.2
vvc_alf_filter_luma_32x32_10_neon: 57.7
vvc_alf_filter_luma_32x32_12_c: 299.2
vvc_alf_filter_luma_32x32_12_neon: 57.7
vvc_alf_filter_luma_64x64_8_c: 1602.5
vvc_alf_filter_luma_64x64_8_neon: 251.7
vvc_alf_filter_luma_64x64_10_c: 1197.0
vvc_alf_filter_luma_64x64_10_neon: 235.5
vvc_alf_filter_luma_64x64_12_c: 1220.2
vvc_alf_filter_luma_64x64_12_neon: 235.7
vvc_alf_filter_luma_128x128_8_c: 6570.2
vvc_alf_filter_luma_128x128_8_neon: 1007.7
vvc_alf_filter_luma_128x128_10_c: 4822.7
vvc_alf_filter_luma_128x128_10_neon: 936.2
vvc_alf_filter_luma_128x128_12_c: 4791.2
vvc_alf_filter_luma_128x128_12_neon: 938.5
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-07-22 21:09:56 +08:00
Anton Khirnov
e4601cc339
lavc/hevc*: move to hevc/ subdir
2024-06-04 11:46:27 +02:00