Skip to content

feat(crc): optimize crc32_iscsi with SVE2 fusion and improved PMULL path#410

Open
zwtao40 wants to merge 1 commit into
intel:masterfrom
zwtao40:dev_crc_aarch64_optimize
Open

feat(crc): optimize crc32_iscsi with SVE2 fusion and improved PMULL path#410
zwtao40 wants to merge 1 commit into
intel:masterfrom
zwtao40:dev_crc_aarch64_optimize

Conversation

@zwtao40
Copy link
Copy Markdown

@zwtao40 zwtao40 commented Apr 29, 2026

This patch optimizes the CRC32 ISCSI implementation for ARM aarch64 platforms with two new accelerated code paths.

  1. Optimized PMULL Path (crc32_iscsi_x6)

For platforms with PMULL support (CRC32+PMULL but no SVE2):

  • Extended parallel folding from 3-way to 6-way computation
  • Improved loop structure and register utilization
  • Better throughput on PMULL-capable CPUs
  1. SVE2 Fusion Path (crc32_iscsi_fusion_p8_c10_asm)

For platforms with SVE2 support:

  • Hybrid scalar-vector parallel computation approach
  • Scalar path: 10-way parallel using crc32cx instructions
  • Vector path: 8-way parallel using pmullt/pmullb instructions
  • Results merged and folded using SVE2 eor3 instruction
  • Optimal for CPUs with SVE2 support

Performance Results:

Kunpeng-920 (SVE, no SVE2):
+------------+------------+------------+--------+
| Data Size | Before | After | Speedup|
+------------+------------+------------+--------+
| 4KB | 33 GB/s | 37 GB/s | 1.12x |
| 8KB | 33 GB/s | 38 GB/s | 1.15x |
+------------+------------+------------+--------+

Kunpeng-950 (SVE2):
+------------+------------+------------+--------+
| Data Size | Before | After | Speedup|
+------------+------------+------------+--------+
| 4KB | 24 GB/s | 45 GB/s | 1.88x |
| 8KB | 23 GB/s | 57 GB/s | 2.48x |
+------------+------------+------------+--------+

Measured using crc32_iscsi_perf tool on Huawei Kunpeng platforms.

Files modified:

  • crc/aarch64/crc_aarch64_dispatcher.c
  • crc/aarch64/crc32_iscsi_p8c10.S (new)
  • crc/aarch64/crc32_iscsi_p8c10_clmul_const.S (new)
  • crc/aarch64/crc32_iscsi_x6.S (new)
  • include/aarch64_multibinary.h

@zwtao40 zwtao40 force-pushed the dev_crc_aarch64_optimize branch 6 times, most recently from d1da083 to ab23426 Compare May 6, 2026 00:53
@zwtao40 zwtao40 force-pushed the dev_crc_aarch64_optimize branch 2 times, most recently from 5e013ee to 00f5151 Compare May 7, 2026 01:13
Optimize CRC32 iSCSI implementation for ARM platforms with two new
accelerated code paths:

1. SVE2 Fusion Path (crc32_iscsi_fusion_p8_c10_asm):
   - Leverages SVE2 EOR3 instruction for parallel CRC computation
   - Processes 8 data lanes simultaneously with carry-less multiplication
   - Optimal for CPUs with SVE2 support (Kunpeng-950, etc.)

2. Optimized PMULL Path (crc32_iscsi_x6):
   - Replaces previous crc32_iscsi_3crc_fold implementation
   - Improved loop structure and register utilization
   - Better performance on CPUs with CRC32+PMULL but no SVE2

Dispatcher Logic:
- Priority: SVE2 > PMULL(x6) > PMULL(base) > NEON
- SVE2 path selected when HWCAP_CRC32 and HWCAP2_SVE2 are both present
- PMULL path uses new x6 implementation for better throughput

Performance Results:

Kunpeng-920 (SVE, no SVE2):
+------------+------------+------------+--------+
| Data Size  | Before     | After      | Speedup|
+------------+------------+------------+--------+
| 4KB        | 33 GB/s    | 37 GB/s    | 1.12x  |
| 8KB        | 33 GB/s    | 38 GB/s    | 1.15x  |
+------------+------------+------------+--------+

Kunpeng-950 (SVE2):
+------------+------------+------------+--------+
| Data Size  | Before     | After      | Speedup|
+------------+------------+------------+--------+
| 4KB        | 24 GB/s    | 45 GB/s    | 1.88x  |
| 8KB        | 23 GB/s    | 57 GB/s    | 2.48x  |
+------------+------------+------------+--------+

Measured using crc32_iscsi_perf tool on Huawei Kunpeng platforms.

Files modified:
- crc/aarch64/Makefile.am
- crc/aarch64/crc_aarch64_dispatcher.c
- crc/aarch64/crc32_iscsi_p8c10.S (new, with macOS compatibility)
- crc/aarch64/crc32_iscsi_p8c10_clmul_const.S (new, with macOS compatibility)
- crc/aarch64/crc32_iscsi_x6.S (new, with macOS compatibility)

Signed-off-by: Chenxuqiang <chenxuqiang3@hisilicon.com>
Signed-off-by: Enigmo <guotaowei4@huawei.com>
@zwtao40 zwtao40 force-pushed the dev_crc_aarch64_optimize branch from 00f5151 to ea6a414 Compare May 14, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant