Makefile + hash.h: remove PPC_SHA1 implementation · freshlybakedca.ke/git@9dc523a

Git fork

Makefile + hash.h: remove PPC_SHA1 implementation

Remove the PPC_SHA1 implementation added in a6ef3518f9a ([PATCH] PPC
assembly implementation of SHA1, 2005-04-22). When this was added
Apple consumer hardware used the PPC architecture, and the
implementation was intended to improve SHA-1 speed there.

Since it was added we've moved to using sha1collisiondetection by
default, and anyone wanting hard-rolled non-DC SHA-1 implementation
can use OpenSSL's via the OPENSSL_SHA1 knob.

The PPC_SHA1 originally originally targeted 32 bit PPC, and later the
64 bit PPC 970 (a.k.a. Apple PowerPC G5). See 926172c5e48 (block-sha1:
improve code on large-register-set machines, 2009-08-10) for a
reference about the performance on G5 (a comment in block-sha1/sha1.c
being removed here).

I can't get it to do anything but segfault on both the BE and LE POWER
machines in the GCC compile farm[1]. Anyone who's concerned about
performance on PPC these days is likely to be using the IBM POWER
processors.

There have been proposals to entirely remove non-sha1collisiondetection
implementations from the tree[2]. I think per [3] that would be a bit
overzealous. I.e. there are various set-ups git's speed is going to be
more important than the relatively implausible SHA-1 collision attack,
or where such attacks are entirely mitigated by other means (e.g. by
incoming objects being checked with DC_SHA1).

But that really doesn't apply to PPC_SHA1 in particular, which seems
to have outlived its usefulness.

As this gets rid of the only in-tree *.S assembly file we can remove
the small bits of logic from the Makefile needed to build objects
from *.S (as opposed to *.c)

The code being removed here was also throwing warnings with the
"-pedantic" flag, it could have been fixed as 544d93bc3b4 (block-sha1:
remove use of obsolete x86 assembly, 2022-03-10) did for block-sha1/*,
but as noted above let's remove it instead.

1. https://cfarm.tetaneutral.net/machines/list/
Tested on gcc{110,112,135,203}, a mixture of POWER [789] ppc64 and
ppc64le. All segfault in anything needing object
hashing (e.g. t/t1007-hash-object.sh) when compiled with
PPC_SHA1=Y.
2. https://lore.kernel.org/git/20200223223758.120941-1-mh@glandium.org/
3. https://lore.kernel.org/git/20200224044732.GK1018190@coredump.intra.peff.net/

Acked-by: brian m. carlson" <sandals@crustytoothpaste.net>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

authored by

Ævar Arnfjörð Bjarmason and committed by

Junio C Hamano 3 years ago 9dc523aa d42b38df

+8 -347

8 changed files

expand all

unified split

INSTALL

Makefile

block-sha1

sha1.c

configure.ac

hash.h

ppc

sha1.c

sha1.h

sha1ppc.S

+1 -2

INSTALL

··· 135 136 By default, git uses OpenSSL for SHA1 but it will use its own 137 library (inspired by Mozilla's) with either NO_OPENSSL or 138 - BLK_SHA1. Also included is a version optimized for PowerPC 139 - (PPC_SHA1). 140 141 - "libcurl" library is used for fetching and pushing 142 repositories over http:// or https://, as well as by

··· 135 136 By default, git uses OpenSSL for SHA1 but it will use its own 137 library (inspired by Mozilla's) with either NO_OPENSSL or 138 + BLK_SHA1. 139 140 - "libcurl" library is used for fetching and pushing 141 repositories over http:// or https://, as well as by

+5 -13

Makefile

··· 155 # Define BLK_SHA1 environment variable to make use of the bundled 156 # optimized C SHA1 routine. 157 # 158 - # Define PPC_SHA1 environment variable when running make to make use of 159 - # a bundled SHA1 routine optimized for PowerPC. 160 - # 161 # Define DC_SHA1 to unconditionally enable the collision-detecting sha1 162 # algorithm. This is slower, but may detect attempted collision attacks. 163 # Takes priority over other *_SHA1 knobs. ··· 1802 SHA1_MAX_BLOCK_SIZE = 1024L*1024L*1024L 1803 endif 1804 1805 ifdef OPENSSL_SHA1 1806 EXTLIBS += $(LIB_4_CRYPTO) 1807 BASIC_CFLAGS += -DSHA1_OPENSSL ··· 1809 ifdef BLK_SHA1 1810 LIB_OBJS += block-sha1/sha1.o 1811 BASIC_CFLAGS += -DSHA1_BLK 1812 - else 1813 - ifdef PPC_SHA1 1814 - LIB_OBJS += ppc/sha1.o ppc/sha1ppc.o 1815 - BASIC_CFLAGS += -DSHA1_PPC 1816 else 1817 ifdef APPLE_COMMON_CRYPTO 1818 COMPAT_CFLAGS += -DCOMMON_DIGEST_FOR_OPENSSL ··· 1843 -DSHA1DC_INIT_SAFE_HASH_DEFAULT=0 \ 1844 -DSHA1DC_CUSTOM_INCLUDE_SHA1_C="\"cache.h\"" \ 1845 -DSHA1DC_CUSTOM_INCLUDE_UBC_CHECK_C="\"git-compat-util.h\"" 1846 - endif 1847 endif 1848 endif 1849 endif ··· 2594 compdb_args = 2595 endif 2596 2597 - ASM_SRC := $(wildcard $(OBJECTS:o=S)) 2598 - ASM_OBJ := $(ASM_SRC:S=o) 2599 - C_OBJ := $(filter-out $(ASM_OBJ),$(OBJECTS)) 2600 2601 $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) $(missing_compdb_dir) 2602 - $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(compdb_args) $(ALL_CFLAGS) $(EXTRA_CPPFLAGS) $< 2603 - $(ASM_OBJ): %.o: %.S GIT-CFLAGS $(missing_dep_dirs) $(missing_compdb_dir) 2604 $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(compdb_args) $(ALL_CFLAGS) $(EXTRA_CPPFLAGS) $< 2605 2606 %.s: %.c GIT-CFLAGS FORCE

··· 155 # Define BLK_SHA1 environment variable to make use of the bundled 156 # optimized C SHA1 routine. 157 # 158 # Define DC_SHA1 to unconditionally enable the collision-detecting sha1 159 # algorithm. This is slower, but may detect attempted collision attacks. 160 # Takes priority over other *_SHA1 knobs. ··· 1799 SHA1_MAX_BLOCK_SIZE = 1024L*1024L*1024L 1800 endif 1801 1802 + ifdef PPC_SHA1 1803 + $(error the PPC_SHA1 flag has been removed along with the PowerPC-specific SHA-1 implementation.) 1804 + endif 1805 + 1806 ifdef OPENSSL_SHA1 1807 EXTLIBS += $(LIB_4_CRYPTO) 1808 BASIC_CFLAGS += -DSHA1_OPENSSL ··· 1810 ifdef BLK_SHA1 1811 LIB_OBJS += block-sha1/sha1.o 1812 BASIC_CFLAGS += -DSHA1_BLK 1813 else 1814 ifdef APPLE_COMMON_CRYPTO 1815 COMPAT_CFLAGS += -DCOMMON_DIGEST_FOR_OPENSSL ··· 1840 -DSHA1DC_INIT_SAFE_HASH_DEFAULT=0 \ 1841 -DSHA1DC_CUSTOM_INCLUDE_SHA1_C="\"cache.h\"" \ 1842 -DSHA1DC_CUSTOM_INCLUDE_UBC_CHECK_C="\"git-compat-util.h\"" 1843 endif 1844 endif 1845 endif ··· 2590 compdb_args = 2591 endif 2592 2593 + C_OBJ := $(OBJECTS) 2594 2595 $(C_OBJ): %.o: %.c GIT-CFLAGS $(missing_dep_dirs) $(missing_compdb_dir) 2596 $(QUIET_CC)$(CC) -o $*.o -c $(dep_args) $(compdb_args) $(ALL_CFLAGS) $(EXTRA_CPPFLAGS) $< 2597 2598 %.s: %.c GIT-CFLAGS FORCE

-4

block-sha1/sha1.c

··· 28 * try to do the silly "optimize away loads" part because it won't 29 * see what the value will be). 30 * 31 - * Ben Herrenschmidt reports that on PPC, the C version comes close 32 - * to the optimized asm with this (ie on PPC you don't want that 33 - * 'volatile', since there are lots of registers). 34 - * 35 * On ARM we get the best code generation by forcing a full memory barrier 36 * between each SHA_ROUND, otherwise gcc happily get wild with spilling and 37 * the stack frame size simply explode and performance goes down the drain.

··· 28 * try to do the silly "optimize away loads" part because it won't 29 * see what the value will be). 30 * 31 * On ARM we get the best code generation by forcing a full memory barrier 32 * between each SHA_ROUND, otherwise gcc happily get wild with spilling and 33 * the stack frame size simply explode and performance goes down the drain.

-3

configure.ac

··· 237 # tests. These tests take up a significant amount of the total test time 238 # but are not needed unless you plan to talk to SVN repos. 239 # 240 - # Define PPC_SHA1 environment variable when running make to make use of 241 - # a bundled SHA1 routine optimized for PowerPC. 242 - # 243 # Define NO_OPENSSL environment variable if you do not have OpenSSL. 244 # 245 # Define OPENSSLDIR=/foo/bar if your openssl header and library files are in

··· 237 # tests. These tests take up a significant amount of the total test time 238 # but are not needed unless you plan to talk to SVN repos. 239 # 240 # Define NO_OPENSSL environment variable if you do not have OpenSSL. 241 # 242 # Define OPENSSLDIR=/foo/bar if your openssl header and library files are in

+2 -4

hash.h

··· 4 #include "git-compat-util.h" 5 #include "repository.h" 6 7 - #if defined(SHA1_PPC) 8 - #include "ppc/sha1.h" 9 - #elif defined(SHA1_APPLE) 10 #include <CommonCrypto/CommonDigest.h> 11 #elif defined(SHA1_OPENSSL) 12 #include <openssl/sha.h> ··· 32 * platform's underlying implementation of SHA-1; could be OpenSSL, 33 * blk_SHA, Apple CommonCrypto, etc... Note that the relevant 34 * SHA-1 header may have already defined platform_SHA_CTX for our 35 - * own implementations like block-sha1 and ppc-sha1, so we list 36 * the default for OpenSSL compatible SHA-1 implementations here. 37 */ 38 #define platform_SHA_CTX SHA_CTX

··· 4 #include "git-compat-util.h" 5 #include "repository.h" 6 7 + #if defined(SHA1_APPLE) 8 #include <CommonCrypto/CommonDigest.h> 9 #elif defined(SHA1_OPENSSL) 10 #include <openssl/sha.h> ··· 30 * platform's underlying implementation of SHA-1; could be OpenSSL, 31 * blk_SHA, Apple CommonCrypto, etc... Note that the relevant 32 * SHA-1 header may have already defined platform_SHA_CTX for our 33 + * own implementations like block-sha1, so we list 34 * the default for OpenSSL compatible SHA-1 implementations here. 35 */ 36 #define platform_SHA_CTX SHA_CTX

-72

ppc/sha1.c

··· 1 - /* 2 - * SHA-1 implementation. 3 - * 4 - * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> 5 - * 6 - * This version assumes we are running on a big-endian machine. 7 - * It calls an external sha1_core() to process blocks of 64 bytes. 8 - */ 9 - #include <stdio.h> 10 - #include <string.h> 11 - #include "sha1.h" 12 - 13 - void ppc_sha1_core(uint32_t *hash, const unsigned char *p, 14 - unsigned int nblocks); 15 - 16 - int ppc_SHA1_Init(ppc_SHA_CTX *c) 17 - { 18 - c->hash[0] = 0x67452301; 19 - c->hash[1] = 0xEFCDAB89; 20 - c->hash[2] = 0x98BADCFE; 21 - c->hash[3] = 0x10325476; 22 - c->hash[4] = 0xC3D2E1F0; 23 - c->len = 0; 24 - c->cnt = 0; 25 - return 0; 26 - } 27 - 28 - int ppc_SHA1_Update(ppc_SHA_CTX *c, const void *ptr, unsigned long n) 29 - { 30 - unsigned long nb; 31 - const unsigned char *p = ptr; 32 - 33 - c->len += (uint64_t) n << 3; 34 - while (n != 0) { 35 - if (c->cnt || n < 64) { 36 - nb = 64 - c->cnt; 37 - if (nb > n) 38 - nb = n; 39 - memcpy(&c->buf.b[c->cnt], p, nb); 40 - if ((c->cnt += nb) == 64) { 41 - ppc_sha1_core(c->hash, c->buf.b, 1); 42 - c->cnt = 0; 43 - } 44 - } else { 45 - nb = n >> 6; 46 - ppc_sha1_core(c->hash, p, nb); 47 - nb <<= 6; 48 - } 49 - n -= nb; 50 - p += nb; 51 - } 52 - return 0; 53 - } 54 - 55 - int ppc_SHA1_Final(unsigned char *hash, ppc_SHA_CTX *c) 56 - { 57 - unsigned int cnt = c->cnt; 58 - 59 - c->buf.b[cnt++] = 0x80; 60 - if (cnt > 56) { 61 - if (cnt < 64) 62 - memset(&c->buf.b[cnt], 0, 64 - cnt); 63 - ppc_sha1_core(c->hash, c->buf.b, 1); 64 - cnt = 0; 65 - } 66 - if (cnt < 56) 67 - memset(&c->buf.b[cnt], 0, 56 - cnt); 68 - c->buf.l[7] = c->len; 69 - ppc_sha1_core(c->hash, c->buf.b, 1); 70 - memcpy(hash, c->hash, 20); 71 - return 0; 72 - }

···

-25

ppc/sha1.h

··· 1 - /* 2 - * SHA-1 implementation. 3 - * 4 - * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> 5 - */ 6 - #include <stdint.h> 7 - 8 - typedef struct { 9 - uint32_t hash[5]; 10 - uint32_t cnt; 11 - uint64_t len; 12 - union { 13 - unsigned char b[64]; 14 - uint64_t l[8]; 15 - } buf; 16 - } ppc_SHA_CTX; 17 - 18 - int ppc_SHA1_Init(ppc_SHA_CTX *c); 19 - int ppc_SHA1_Update(ppc_SHA_CTX *c, const void *p, unsigned long n); 20 - int ppc_SHA1_Final(unsigned char *hash, ppc_SHA_CTX *c); 21 - 22 - #define platform_SHA_CTX ppc_SHA_CTX 23 - #define platform_SHA1_Init ppc_SHA1_Init 24 - #define platform_SHA1_Update ppc_SHA1_Update 25 - #define platform_SHA1_Final ppc_SHA1_Final

···

-224

ppc/sha1ppc.S

··· 1 - /* 2 - * SHA-1 implementation for PowerPC. 3 - * 4 - * Copyright (C) 2005 Paul Mackerras <paulus@samba.org> 5 - */ 6 - 7 - /* 8 - * PowerPC calling convention: 9 - * %r0 - volatile temp 10 - * %r1 - stack pointer. 11 - * %r2 - reserved 12 - * %r3-%r12 - Incoming arguments & return values; volatile. 13 - * %r13-%r31 - Callee-save registers 14 - * %lr - Return address, volatile 15 - * %ctr - volatile 16 - * 17 - * Register usage in this routine: 18 - * %r0 - temp 19 - * %r3 - argument (pointer to 5 words of SHA state) 20 - * %r4 - argument (pointer to data to hash) 21 - * %r5 - Constant K in SHA round (initially number of blocks to hash) 22 - * %r6-%r10 - Working copies of SHA variables A..E (actually E..A order) 23 - * %r11-%r26 - Data being hashed W[]. 24 - * %r27-%r31 - Previous copies of A..E, for final add back. 25 - * %ctr - loop count 26 - */ 27 - 28 - 29 - /* 30 - * We roll the registers for A, B, C, D, E around on each 31 - * iteration; E on iteration t is D on iteration t+1, and so on. 32 - * We use registers 6 - 10 for this. (Registers 27 - 31 hold 33 - * the previous values.) 34 - */ 35 - #define RA(t) (((t)+4)%5+6) 36 - #define RB(t) (((t)+3)%5+6) 37 - #define RC(t) (((t)+2)%5+6) 38 - #define RD(t) (((t)+1)%5+6) 39 - #define RE(t) (((t)+0)%5+6) 40 - 41 - /* We use registers 11 - 26 for the W values */ 42 - #define W(t) ((t)%16+11) 43 - 44 - /* Register 5 is used for the constant k */ 45 - 46 - /* 47 - * The basic SHA-1 round function is: 48 - * E += ROTL(A,5) + F(B,C,D) + W[i] + K; B = ROTL(B,30) 49 - * Then the variables are renamed: (A,B,C,D,E) = (E,A,B,C,D). 50 - * 51 - * Every 20 rounds, the function F() and the constant K changes: 52 - * - 20 rounds of f0(b,c,d) = "bit wise b ? c : d" = (^b & d) + (b & c) 53 - * - 20 rounds of f1(b,c,d) = b^c^d = (b^d)^c 54 - * - 20 rounds of f2(b,c,d) = majority(b,c,d) = (b&d) + ((b^d)&c) 55 - * - 20 more rounds of f1(b,c,d) 56 - * 57 - * These are all scheduled for near-optimal performance on a G4. 58 - * The G4 is a 3-issue out-of-order machine with 3 ALUs, but it can only 59 - * *consider* starting the oldest 3 instructions per cycle. So to get 60 - * maximum performance out of it, you have to treat it as an in-order 61 - * machine. Which means interleaving the computation round t with the 62 - * computation of W[t+4]. 63 - * 64 - * The first 16 rounds use W values loaded directly from memory, while the 65 - * remaining 64 use values computed from those first 16. We preload 66 - * 4 values before starting, so there are three kinds of rounds: 67 - * - The first 12 (all f0) also load the W values from memory. 68 - * - The next 64 compute W(i+4) in parallel. 8*f0, 20*f1, 20*f2, 16*f1. 69 - * - The last 4 (all f1) do not do anything with W. 70 - * 71 - * Therefore, we have 6 different round functions: 72 - * STEPD0_LOAD(t,s) - Perform round t and load W(s). s < 16 73 - * STEPD0_UPDATE(t,s) - Perform round t and compute W(s). s >= 16. 74 - * STEPD1_UPDATE(t,s) 75 - * STEPD2_UPDATE(t,s) 76 - * STEPD1(t) - Perform round t with no load or update. 77 - * 78 - * The G5 is more fully out-of-order, and can find the parallelism 79 - * by itself. The big limit is that it has a 2-cycle ALU latency, so 80 - * even though it's 2-way, the code has to be scheduled as if it's 81 - * 4-way, which can be a limit. To help it, we try to schedule the 82 - * read of RA(t) as late as possible so it doesn't stall waiting for 83 - * the previous round's RE(t-1), and we try to rotate RB(t) as early 84 - * as possible while reading RC(t) (= RB(t-1)) as late as possible. 85 - */ 86 - 87 - /* the initial loads. */ 88 - #define LOADW(s) \ 89 - lwz W(s),(s)*4(%r4) 90 - 91 - /* 92 - * Perform a step with F0, and load W(s). Uses W(s) as a temporary 93 - * before loading it. 94 - * This is actually 10 instructions, which is an awkward fit. 95 - * It can execute grouped as listed, or delayed one instruction. 96 - * (If delayed two instructions, there is a stall before the start of the 97 - * second line.) Thus, two iterations take 7 cycles, 3.5 cycles per round. 98 - */ 99 - #define STEPD0_LOAD(t,s) \ 100 - add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); and W(s),RC(t),RB(t); \ 101 - add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; rotlwi RB(t),RB(t),30; \ 102 - add RE(t),RE(t),W(s); add %r0,%r0,%r5; lwz W(s),(s)*4(%r4); \ 103 - add RE(t),RE(t),%r0 104 - 105 - /* 106 - * This is likewise awkward, 13 instructions. However, it can also 107 - * execute starting with 2 out of 3 possible moduli, so it does 2 rounds 108 - * in 9 cycles, 4.5 cycles/round. 109 - */ 110 - #define STEPD0_UPDATE(t,s,loadk...) \ 111 - add RE(t),RE(t),W(t); andc %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ 112 - add RE(t),RE(t),%r0; and %r0,RC(t),RB(t); xor W(s),W(s),W((s)-8); \ 113 - add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ 114 - add RE(t),RE(t),%r5; loadk; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1; \ 115 - add RE(t),RE(t),%r0 116 - 117 - /* Nicely optimal. Conveniently, also the most common. */ 118 - #define STEPD1_UPDATE(t,s,loadk...) \ 119 - add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ 120 - add RE(t),RE(t),%r5; loadk; xor %r0,%r0,RC(t); xor W(s),W(s),W((s)-8); \ 121 - add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; xor W(s),W(s),W((s)-14); \ 122 - add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30; rotlwi W(s),W(s),1 123 - 124 - /* 125 - * The naked version, no UPDATE, for the last 4 rounds. 3 cycles per. 126 - * We could use W(s) as a temp register, but we don't need it. 127 - */ 128 - #define STEPD1(t) \ 129 - add RE(t),RE(t),W(t); xor %r0,RD(t),RB(t); \ 130 - rotlwi RB(t),RB(t),30; add RE(t),RE(t),%r5; xor %r0,%r0,RC(t); \ 131 - add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; /* spare slot */ \ 132 - add RE(t),RE(t),%r0 133 - 134 - /* 135 - * 14 instructions, 5 cycles per. The majority function is a bit 136 - * awkward to compute. This can execute with a 1-instruction delay, 137 - * but it causes a 2-instruction delay, which triggers a stall. 138 - */ 139 - #define STEPD2_UPDATE(t,s,loadk...) \ 140 - add RE(t),RE(t),W(t); and %r0,RD(t),RB(t); xor W(s),W((s)-16),W((s)-3); \ 141 - add RE(t),RE(t),%r0; xor %r0,RD(t),RB(t); xor W(s),W(s),W((s)-8); \ 142 - add RE(t),RE(t),%r5; loadk; and %r0,%r0,RC(t); xor W(s),W(s),W((s)-14); \ 143 - add RE(t),RE(t),%r0; rotlwi %r0,RA(t),5; rotlwi W(s),W(s),1; \ 144 - add RE(t),RE(t),%r0; rotlwi RB(t),RB(t),30 145 - 146 - #define STEP0_LOAD4(t,s) \ 147 - STEPD0_LOAD(t,s); \ 148 - STEPD0_LOAD((t+1),(s)+1); \ 149 - STEPD0_LOAD((t)+2,(s)+2); \ 150 - STEPD0_LOAD((t)+3,(s)+3) 151 - 152 - #define STEPUP4(fn, t, s, loadk...) \ 153 - STEP##fn##_UPDATE(t,s,); \ 154 - STEP##fn##_UPDATE((t)+1,(s)+1,); \ 155 - STEP##fn##_UPDATE((t)+2,(s)+2,); \ 156 - STEP##fn##_UPDATE((t)+3,(s)+3,loadk) 157 - 158 - #define STEPUP20(fn, t, s, loadk...) \ 159 - STEPUP4(fn, t, s,); \ 160 - STEPUP4(fn, (t)+4, (s)+4,); \ 161 - STEPUP4(fn, (t)+8, (s)+8,); \ 162 - STEPUP4(fn, (t)+12, (s)+12,); \ 163 - STEPUP4(fn, (t)+16, (s)+16, loadk) 164 - 165 - .globl ppc_sha1_core 166 - ppc_sha1_core: 167 - stwu %r1,-80(%r1) 168 - stmw %r13,4(%r1) 169 - 170 - /* Load up A - E */ 171 - lmw %r27,0(%r3) 172 - 173 - mtctr %r5 174 - 175 - 1: 176 - LOADW(0) 177 - lis %r5,0x5a82 178 - mr RE(0),%r31 179 - LOADW(1) 180 - mr RD(0),%r30 181 - mr RC(0),%r29 182 - LOADW(2) 183 - ori %r5,%r5,0x7999 /* K0-19 */ 184 - mr RB(0),%r28 185 - LOADW(3) 186 - mr RA(0),%r27 187 - 188 - STEP0_LOAD4(0, 4) 189 - STEP0_LOAD4(4, 8) 190 - STEP0_LOAD4(8, 12) 191 - STEPUP4(D0, 12, 16,) 192 - STEPUP4(D0, 16, 20, lis %r5,0x6ed9) 193 - 194 - ori %r5,%r5,0xeba1 /* K20-39 */ 195 - STEPUP20(D1, 20, 24, lis %r5,0x8f1b) 196 - 197 - ori %r5,%r5,0xbcdc /* K40-59 */ 198 - STEPUP20(D2, 40, 44, lis %r5,0xca62) 199 - 200 - ori %r5,%r5,0xc1d6 /* K60-79 */ 201 - STEPUP4(D1, 60, 64,) 202 - STEPUP4(D1, 64, 68,) 203 - STEPUP4(D1, 68, 72,) 204 - STEPUP4(D1, 72, 76,) 205 - addi %r4,%r4,64 206 - STEPD1(76) 207 - STEPD1(77) 208 - STEPD1(78) 209 - STEPD1(79) 210 - 211 - /* Add results to original values */ 212 - add %r31,%r31,RE(0) 213 - add %r30,%r30,RD(0) 214 - add %r29,%r29,RC(0) 215 - add %r28,%r28,RB(0) 216 - add %r27,%r27,RA(0) 217 - 218 - bdnz 1b 219 - 220 - /* Save final hash, restore registers, and return */ 221 - stmw %r27,0(%r3) 222 - lmw %r13,4(%r1) 223 - addi %r1,%r1,80 224 - blr

···