mirror of
https://github.com/game-stop/veejay.git
synced 2025-12-18 13:49:58 +01:00
98 lines
3.7 KiB
Plaintext
98 lines
3.7 KiB
Plaintext
fastarm
|
|
|
|
This toolkit contains a set of fast memcpy/memset variants for ARM
|
|
platforms. They either use the standard register file, or optionally
|
|
NEON instructions,
|
|
|
|
Several basic families of variants are provided; the current ones are
|
|
the "new memcpy" variants which are the default for memcpy replacement,
|
|
which generally do not overfetch beyond the source region and can be
|
|
configured to use unaligned memory access for small sizes, or to use
|
|
strictly aligned memory access. This family can also be configured to
|
|
include a fast path for smaller sizes (this is the default), disabling
|
|
this results in smaller code size at the expense of worse performance
|
|
for small sizes. NEON optimized versions, which are generally faster
|
|
with reduced code size, are also provided.
|
|
|
|
To compile the benchmark program, run 'make'. This will compile in a
|
|
plethora of variants with different preload strategies, block sizes,
|
|
alignment etc.
|
|
|
|
A benchmark program to compare various memcpy variants is provided. Try
|
|
something like "./benchmark --memcpy ad --all". (Use --memcpy al on the
|
|
Raspberry Pi platform).
|
|
|
|
To compile a memcpy replacement library, set PLATFORM to one of the
|
|
values described at the beginning of the Makefile. This selects the
|
|
cache line size to use and whether to use NEON versions.
|
|
|
|
Optionally disable Thumb2 mode compilation by commenting out the THUMBFLAGS
|
|
definition. It must be disabled on the Raspberry Pi.
|
|
|
|
Then run:
|
|
|
|
sudo make install_memcpy_replacement
|
|
|
|
The replacement memcpy/memset shared library will be installed into
|
|
/usr/lib/arm-linux-gnueabihf/ as libfastarm.so.
|
|
|
|
To enable the use of the replacement memcpy in applications, create or edit
|
|
the file /etc/ld.so.preload so that it contains the line:
|
|
|
|
/usr/lib/arm-linux-gnueabihf/libfastarm.so
|
|
|
|
On the RPi platform, references to libcofi_rpi.so should be commented out
|
|
or deleted. The new memcpy should now be activated for newly launched
|
|
programs. To be sure, reboot or run:
|
|
|
|
sudo ldconfig
|
|
|
|
To revert to the default optimized memcpy on the RPi platform,
|
|
edit /etc/ld.so.preload so that it contains the line:
|
|
|
|
/usr/lib/arm-linux-gnueabihf/libcofi_rpi.so
|
|
|
|
instead of the one using libfastarm.so.
|
|
|
|
Note on cache line size:
|
|
|
|
Although assuming a preload line size 64 bytes is a little faster on several
|
|
Cortex platforms for small to moderate sizes, when accessing DRAM
|
|
with larger sizes assuming 32 byte preloads seems to be faster. On earlier
|
|
Cortex A9 models, 32 byte preloads are required for good performance in all
|
|
cases.
|
|
|
|
Notes on performance with and without NEON:
|
|
|
|
For NEON-based memcpy, a significant benefit is seen on the tested Cortex A8
|
|
platform for unaligned copies in cache memory and for aligned and unaligned
|
|
copies in DRAM. Performance for aligned copies in cache memory is relatively
|
|
similar to the optimized non-NEON function.
|
|
|
|
Results in MB/s on a Cortex A8, with Thumb2 mode enabled, of
|
|
standard libc (Debian unstable), armv7 and NEON optimized memcpy
|
|
variants with line size of 32 bytes:
|
|
|
|
libc armv7 NEON
|
|
test 0 522 549 567
|
|
test 1 329 377 378
|
|
test 2 434 430 513
|
|
test 28 351 361 458
|
|
test 29 246 248 358
|
|
test 43 467 512 581
|
|
|
|
Test 0 in the benchmark program tests word-aligned requests with
|
|
sizes that are a power of 2 up to 4096 bytes distributed according
|
|
to a power law.
|
|
Test 1 in the benchmark program tests word-aligned requests with
|
|
sizes up to 1024 that are a multiple of 4, distributed according
|
|
to a power law.
|
|
Test 2 in the benchmark program tests unaligned requests with sizes
|
|
up to 1023 bytes.
|
|
Test 28 in the benchmark program tests word aligned requests in DRAM
|
|
with sizes up to 1024 bytes.
|
|
Test 29 in the benchmark program tests word aligned requests in DRAM
|
|
with sizes up to 256 bytes.
|
|
Test 43 in the benchmark program tests page aligned requests in DRAM
|
|
of size 4096 (copying a memory page).
|