mirror of
https://git.ffmpeg.org/ffmpeg.git
synced 2026-05-09 12:32:56 +02:00
f5ed254528
The mmx function performs two registers in parallel;
given the larger register size of SSE2, the same amount
of data can be processed in one register with some speedups.
(Given that this function is used for tail-processing,
not processing more data is important.)
Switching to SSE2 also fixes a bug introduced in
554c2bc708: Since said
commit, only half the dither values were used. This
seems not to matter in practice, as the functions here
use dither only in the following form:
((filtersize-1)*8+dither)>>4. The dither values used
here come from ff_dither_8x8_128 which has the property
that ff_dither_8x8_128[i][j] and ff_dither_8x8_128[i][j+4]
always lead to the same result in the above formula.
Old benchmarks:
yuv2yuvX_8_2_0_512_approximate_c: 2309.9 ( 1.00x)
yuv2yuvX_8_2_0_512_approximate_mmxext: 250.2 ( 9.23x)
yuv2yuvX_8_2_0_512_approximate_sse3: 98.8 (23.39x)
yuv2yuvX_8_2_0_512_approximate_avx2: 52.9 (43.63x)
yuv2yuvX_8_2_16_512_approximate_c: 2263.0 ( 1.00x)
yuv2yuvX_8_2_16_512_approximate_mmxext: 245.3 ( 9.22x)
yuv2yuvX_8_2_16_512_approximate_sse3: 114.3 (19.80x)
yuv2yuvX_8_2_16_512_approximate_avx2: 85.6 (26.45x)
yuv2yuvX_8_2_32_512_approximate_c: 2155.8 ( 1.00x)
yuv2yuvX_8_2_32_512_approximate_mmxext: 235.6 ( 9.15x)
yuv2yuvX_8_2_32_512_approximate_sse3: 93.6 (23.04x)
yuv2yuvX_8_2_32_512_approximate_avx2: 78.1 (27.60x)
yuv2yuvX_8_2_48_512_approximate_c: 2084.8 ( 1.00x)
yuv2yuvX_8_2_48_512_approximate_mmxext: 230.2 ( 9.05x)
yuv2yuvX_8_2_48_512_approximate_sse3: 105.0 (19.85x)
yuv2yuvX_8_2_48_512_approximate_avx2: 71.9 (29.00x)
yuv2yuvX_8_4_0_512_approximate_c: 3496.3 ( 1.00x)
yuv2yuvX_8_4_0_512_approximate_mmxext: 455.0 ( 7.68x)
yuv2yuvX_8_4_0_512_approximate_sse3: 157.5 (22.20x)
yuv2yuvX_8_4_0_512_approximate_avx2: 88.4 (39.53x)
yuv2yuvX_8_4_16_512_approximate_c: 3380.9 ( 1.00x)
yuv2yuvX_8_4_16_512_approximate_mmxext: 440.0 ( 7.68x)
yuv2yuvX_8_4_16_512_approximate_sse3: 175.0 (19.32x)
yuv2yuvX_8_4_16_512_approximate_avx2: 134.1 (25.22x)
yuv2yuvX_8_4_32_512_approximate_c: 3277.6 ( 1.00x)
yuv2yuvX_8_4_32_512_approximate_mmxext: 427.2 ( 7.67x)
yuv2yuvX_8_4_32_512_approximate_sse3: 149.7 (21.89x)
yuv2yuvX_8_4_32_512_approximate_avx2: 115.5 (28.37x)
yuv2yuvX_8_4_48_512_approximate_c: 3167.8 ( 1.00x)
yuv2yuvX_8_4_48_512_approximate_mmxext: 414.9 ( 7.63x)
yuv2yuvX_8_4_48_512_approximate_sse3: 164.1 (19.31x)
yuv2yuvX_8_4_48_512_approximate_avx2: 101.2 (31.30x)
yuv2yuvX_8_8_0_512_approximate_c: 5987.5 ( 1.00x)
yuv2yuvX_8_8_0_512_approximate_mmxext: 854.1 ( 7.01x)
yuv2yuvX_8_8_0_512_approximate_sse3: 294.6 (20.32x)
yuv2yuvX_8_8_0_512_approximate_avx2: 144.1 (41.56x)
yuv2yuvX_8_8_16_512_approximate_c: 5848.9 ( 1.00x)
yuv2yuvX_8_8_16_512_approximate_mmxext: 834.4 ( 7.01x)
yuv2yuvX_8_8_16_512_approximate_sse3: 312.1 (18.74x)
yuv2yuvX_8_8_16_512_approximate_avx2: 214.9 (27.22x)
yuv2yuvX_8_8_32_512_approximate_c: 5610.1 ( 1.00x)
yuv2yuvX_8_8_32_512_approximate_mmxext: 811.6 ( 6.91x)
yuv2yuvX_8_8_32_512_approximate_sse3: 277.5 (20.21x)
yuv2yuvX_8_8_32_512_approximate_avx2: 189.8 (29.55x)
yuv2yuvX_8_8_48_512_approximate_c: 5415.8 ( 1.00x)
yuv2yuvX_8_8_48_512_approximate_mmxext: 782.3 ( 6.92x)
yuv2yuvX_8_8_48_512_approximate_sse3: 289.4 (18.72x)
yuv2yuvX_8_8_48_512_approximate_avx2: 165.3 (32.76x)
yuv2yuvX_8_16_0_512_approximate_c: 11100.7 ( 1.00x)
yuv2yuvX_8_16_0_512_approximate_mmxext: 1682.1 ( 6.60x)
yuv2yuvX_8_16_0_512_approximate_sse3: 558.8 (19.86x)
yuv2yuvX_8_16_0_512_approximate_avx2: 280.1 (39.63x)
yuv2yuvX_8_16_16_512_approximate_c: 10772.1 ( 1.00x)
yuv2yuvX_8_16_16_512_approximate_mmxext: 1611.0 ( 6.69x)
yuv2yuvX_8_16_16_512_approximate_sse3: 578.1 (18.63x)
yuv2yuvX_8_16_16_512_approximate_avx2: 418.8 (25.72x)
yuv2yuvX_8_16_32_512_approximate_c: 10381.5 ( 1.00x)
yuv2yuvX_8_16_32_512_approximate_mmxext: 1560.4 ( 6.65x)
yuv2yuvX_8_16_32_512_approximate_sse3: 525.8 (19.74x)
yuv2yuvX_8_16_32_512_approximate_avx2: 370.7 (28.01x)
yuv2yuvX_8_16_48_512_approximate_c: 10046.1 ( 1.00x)
yuv2yuvX_8_16_48_512_approximate_mmxext: 1512.4 ( 6.64x)
yuv2yuvX_8_16_48_512_approximate_sse3: 546.0 (18.40x)
yuv2yuvX_8_16_48_512_approximate_avx2: 315.0 (31.89x)
New benchmarks:
yuv2yuvX_8_2_0_512_approximate_c: 2302.5 ( 1.00x)
yuv2yuvX_8_2_0_512_approximate_sse2: 184.4 (12.49x)
yuv2yuvX_8_2_0_512_approximate_sse3: 100.1 (23.01x)
yuv2yuvX_8_2_0_512_approximate_avx2: 54.9 (41.98x)
yuv2yuvX_8_2_16_512_approximate_c: 2224.6 ( 1.00x)
yuv2yuvX_8_2_16_512_approximate_sse2: 180.0 (12.36x)
yuv2yuvX_8_2_16_512_approximate_sse3: 109.5 (20.31x)
yuv2yuvX_8_2_16_512_approximate_avx2: 81.3 (27.35x)
yuv2yuvX_8_2_32_512_approximate_c: 2165.3 ( 1.00x)
yuv2yuvX_8_2_32_512_approximate_sse2: 176.6 (12.26x)
yuv2yuvX_8_2_32_512_approximate_sse3: 93.7 (23.11x)
yuv2yuvX_8_2_32_512_approximate_avx2: 73.1 (29.61x)
yuv2yuvX_8_2_48_512_approximate_c: 2088.0 ( 1.00x)
yuv2yuvX_8_2_48_512_approximate_sse2: 170.7 (12.23x)
yuv2yuvX_8_2_48_512_approximate_sse3: 103.4 (20.20x)
yuv2yuvX_8_2_48_512_approximate_avx2: 69.4 (30.10x)
yuv2yuvX_8_4_0_512_approximate_c: 3496.8 ( 1.00x)
yuv2yuvX_8_4_0_512_approximate_sse2: 320.3 (10.92x)
yuv2yuvX_8_4_0_512_approximate_sse3: 158.8 (22.02x)
yuv2yuvX_8_4_0_512_approximate_avx2: 86.4 (40.49x)
yuv2yuvX_8_4_16_512_approximate_c: 3443.5 ( 1.00x)
yuv2yuvX_8_4_16_512_approximate_sse2: 325.3 (10.59x)
yuv2yuvX_8_4_16_512_approximate_sse3: 171.9 (20.03x)
yuv2yuvX_8_4_16_512_approximate_avx2: 123.6 (27.85x)
yuv2yuvX_8_4_32_512_approximate_c: 3272.2 ( 1.00x)
yuv2yuvX_8_4_32_512_approximate_sse2: 302.7 (10.81x)
yuv2yuvX_8_4_32_512_approximate_sse3: 148.9 (21.98x)
yuv2yuvX_8_4_32_512_approximate_avx2: 110.6 (29.58x)
yuv2yuvX_8_4_48_512_approximate_c: 3166.3 ( 1.00x)
yuv2yuvX_8_4_48_512_approximate_sse2: 291.0 (10.88x)
yuv2yuvX_8_4_48_512_approximate_sse3: 162.9 (19.44x)
yuv2yuvX_8_4_48_512_approximate_avx2: 102.3 (30.95x)
yuv2yuvX_8_8_0_512_approximate_c: 5967.6 ( 1.00x)
yuv2yuvX_8_8_0_512_approximate_sse2: 691.2 ( 8.63x)
yuv2yuvX_8_8_0_512_approximate_sse3: 294.2 (20.28x)
yuv2yuvX_8_8_0_512_approximate_avx2: 154.9 (38.52x)
yuv2yuvX_8_8_16_512_approximate_c: 5780.2 ( 1.00x)
yuv2yuvX_8_8_16_512_approximate_sse2: 606.2 ( 9.53x)
yuv2yuvX_8_8_16_512_approximate_sse3: 309.3 (18.69x)
yuv2yuvX_8_8_16_512_approximate_avx2: 208.7 (27.69x)
yuv2yuvX_8_8_32_512_approximate_c: 5604.3 ( 1.00x)
yuv2yuvX_8_8_32_512_approximate_sse2: 592.3 ( 9.46x)
yuv2yuvX_8_8_32_512_approximate_sse3: 281.1 (19.94x)
yuv2yuvX_8_8_32_512_approximate_avx2: 185.4 (30.23x)
yuv2yuvX_8_8_48_512_approximate_c: 5413.7 ( 1.00x)
yuv2yuvX_8_8_48_512_approximate_sse2: 570.4 ( 9.49x)
yuv2yuvX_8_8_48_512_approximate_sse3: 294.9 (18.36x)
yuv2yuvX_8_8_48_512_approximate_avx2: 166.5 (32.51x)
yuv2yuvX_8_16_0_512_approximate_c: 11099.4 ( 1.00x)
yuv2yuvX_8_16_0_512_approximate_sse2: 1213.6 ( 9.15x)
yuv2yuvX_8_16_0_512_approximate_sse3: 563.0 (19.72x)
yuv2yuvX_8_16_0_512_approximate_avx2: 294.8 (37.65x)
yuv2yuvX_8_16_16_512_approximate_c: 10718.1 ( 1.00x)
yuv2yuvX_8_16_16_512_approximate_sse2: 1121.2 ( 9.56x)
yuv2yuvX_8_16_16_512_approximate_sse3: 563.7 (19.01x)
yuv2yuvX_8_16_16_512_approximate_avx2: 389.5 (27.51x)
yuv2yuvX_8_16_32_512_approximate_c: 10373.3 ( 1.00x)
yuv2yuvX_8_16_32_512_approximate_sse2: 1096.2 ( 9.46x)
yuv2yuvX_8_16_32_512_approximate_sse3: 526.7 (19.70x)
yuv2yuvX_8_16_32_512_approximate_avx2: 354.7 (29.24x)
yuv2yuvX_8_16_48_512_approximate_c: 10066.9 ( 1.00x)
yuv2yuvX_8_16_48_512_approximate_sse2: 1055.8 ( 9.53x)
yuv2yuvX_8_16_48_512_approximate_sse3: 527.9 (19.07x)
yuv2yuvX_8_16_48_512_approximate_avx2: 313.7 (32.09x)
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>