fastarm This toolkit contains a set of fast memcpy/memset variants for ARM platforms. They either use the standard register file, or optionally NEON instructions, Several basic families of variants are provided; the current ones are the "new memcpy" variants which are the default for memcpy replacement, which generally do not overfetch beyond the source region and can be configured to use unaligned memory access for small sizes, or to use strictly aligned memory access. This family can also be configured to include a fast path for smaller sizes (this is the default), disabling this results in smaller code size at the expense of worse performance for small sizes. NEON optimized versions, which are generally faster with reduced code size, are also provided. To compile the benchmark program, run 'make'. This will compile in a plethora of variants with different preload strategies, block sizes, alignment etc. A benchmark program to compare various memcpy variants is provided. Try something like "./benchmark --memcpy ad --all". (Use --memcpy al on the Raspberry Pi platform). To compile a memcpy replacement library, set PLATFORM to one of the values described at the beginning of the Makefile. This selects the cache line size to use and whether to use NEON versions. Optionally disable Thumb2 mode compilation by commenting out the THUMBFLAGS definition. It must be disabled on the Raspberry Pi. Then run: sudo make install_memcpy_replacement The replacement memcpy/memset shared library will be installed into /usr/lib/arm-linux-gnueabihf/ as libfastarm.so. To enable the use of the replacement memcpy in applications, create or edit the file /etc/ld.so.preload so that it contains the line: /usr/lib/arm-linux-gnueabihf/libfastarm.so On the RPi platform, references to libcofi_rpi.so should be commented out or deleted. The new memcpy should now be activated for newly launched programs. To be sure, reboot or run: sudo ldconfig To revert to the default optimized memcpy on the RPi platform, edit /etc/ld.so.preload so that it contains the line: /usr/lib/arm-linux-gnueabihf/libcofi_rpi.so instead of the one using libfastarm.so. Note on cache line size: Although assuming a preload line size 64 bytes is a little faster on several Cortex platforms for small to moderate sizes, when accessing DRAM with larger sizes assuming 32 byte preloads seems to be faster. On earlier Cortex A9 models, 32 byte preloads are required for good performance in all cases. Notes on performance with and without NEON: For NEON-based memcpy, a significant benefit is seen on the tested Cortex A8 platform for unaligned copies in cache memory and for aligned and unaligned copies in DRAM. Performance for aligned copies in cache memory is relatively similar to the optimized non-NEON function. Results in MB/s on a Cortex A8, with Thumb2 mode enabled, of standard libc (Debian unstable), armv7 and NEON optimized memcpy variants with line size of 32 bytes: libc armv7 NEON test 0 522 549 567 test 1 329 377 378 test 2 434 430 513 test 28 351 361 458 test 29 246 248 358 test 43 467 512 581 Test 0 in the benchmark program tests word-aligned requests with sizes that are a power of 2 up to 4096 bytes distributed according to a power law. Test 1 in the benchmark program tests word-aligned requests with sizes up to 1024 that are a multiple of 4, distributed according to a power law. Test 2 in the benchmark program tests unaligned requests with sizes up to 1023 bytes. Test 28 in the benchmark program tests word aligned requests in DRAM with sizes up to 1024 bytes. Test 29 in the benchmark program tests word aligned requests in DRAM with sizes up to 256 bytes. Test 43 in the benchmark program tests page aligned requests in DRAM of size 4096 (copying a memory page).