Fast implementation of Git in pure Go

research: Add packfile bloom filter RFC

runxiyu.tngl.sh 17342072 94011e37

verified
+130
+130
research/packfile_bloom.txt
··· 1 + Packfile bloom filter RFC 2 + ========================= 3 + 4 + Problem 5 + ------- 6 + 7 + Especially for server-side usages, repacking is extremely expensive, and 8 + creating multi-pack-indexes is still rather expensive. Incremental MIDX 9 + partially solves this, but would defeat the purpose of MIDX when there are too 10 + many of them, as Git would still have to walk the MIDXes in order while 11 + performing expensive indexing queries. 12 + 13 + Idea 14 + ---- 15 + 16 + Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and 17 + ordinary .idx files are still traversed in their usual order, but the first 18 + step when traversing them, is to check whether that index could possibly have 19 + the desired object, through a bloom filter. 20 + 21 + We will want the filters to be mmaped, and we want the lookup cost to be 22 + dominated by one cache-line read rather than using many scattered reads. 23 + Therefore, a blocked bloom filter is likely the right direction here. The steps 24 + are as follows: 25 + 26 + 1. Split the filter into 64-octet buckets, since 64 octets is the most common 27 + cache-line size. 28 + 2. Use some bits of the object ID to choose the bucket. 29 + 3. Use the rest of the key to choose several bit positions inside that bucket. 30 + 4. A lookup thus reads one 64-octet bucket and checks whether all required bits 31 + are set. 32 + 33 + Note on Object IDs 34 + ------------------ 35 + 36 + Git object IDs are cryptographic hashes (e.g., currently either SHA-256 37 + or SHA-1), and are thus uniformly distributed in non-pathological scenarios. 38 + See also the "Security considerations" section. 39 + 40 + Definitions 41 + ----------- 42 + 43 + Let: 44 + 45 + B := number of buckets 46 + K := number of bits set and tested per object ID 47 + 48 + * All integers here are big endian. 49 + * The OID is to be interpreted as a big-endian bitstring, where bit offset 0 50 + is the most significant bit of octet 0. 51 + * log2(B) + 9K <= hash length in bits. 52 + 53 + File layout 54 + ----------- 55 + 56 + * 4-octet signature: {'I', 'D', 'B', 'L'} 57 + * 4-octet version identifier (= 1) 58 + * 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256) 59 + * 4-octet B (number of buckets) 60 + * 2-octet K (number of bits set and tested per object ID) 61 + * 6-octet padding (must be all zeros) 62 + * B buckets of 64 octets each. 63 + 64 + Validation 65 + ---------- 66 + 67 + * Matching signature 68 + * Supported version (the rest of the rules are for this version) 69 + * Hash function identifier must be recognized 70 + * B must be nonzero and a power of two 71 + * K must be nonzero 72 + * log2(B) + 9K <= hash length in bits 73 + * Padding must be all zero 74 + * File size must be 24 + 64 * B octets 75 + 76 + Lookup procedure 77 + ---------------- 78 + 79 + 1. Let b be the unsigned integer encoded by the most significant log2(B) bits 80 + of OID. B is a power of two, and 0 <= b < B. 81 + 2. Select and read bucket b. 82 + 3. For each 0 <= i < K: 83 + 1. Start immediately after the most significant log2(B) bits of OID, let the 84 + i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the 85 + next 9 * K bits of the OID. 86 + 2. Let pi be the unsigned integer encoded by that 9-bit field. 87 + Then, 0 <= pi < 512. 88 + 3. Compute wi := pi >> 6, and bi := pi & 63. 89 + Thus, wi identifies one of the 8 64-bit words in bucket b, and bi 90 + identifies one bit within that word. 91 + 4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit 92 + word, bit index 0 denotes the most significant bit, and bit index 63 93 + denotes the least significant bit.) 94 + 95 + If any test fails, the OID is definitely not in the relevant idx. 96 + If all tests succeed, the OID may be in the relevant idx. 97 + 98 + Note that two of the K 9-bit fields can decode to the same pi, which means an 99 + insertion may set fewer than K distinct bits. 100 + 101 + Worked example 102 + -------------- 103 + 104 + Let: 105 + 106 + B = 1 << 15 = 32768 107 + K = 8 108 + 109 + Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket 110 + and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of 111 + 87 bits taken from the object ID. 112 + 113 + 1. Read the first 15 bits of OID and interpret them as b, where 114 + 0 <= b < 32768. 115 + 2. Read bucket b. 116 + 3. For each 0 <= i < 8: 117 + 1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it 118 + as pi, where 0 <= pi < 512. 119 + 2. Compute: wi = pi >> 6, bi = pi & 63. 120 + 3. Test whether bit bi is set in the word wi of bucket b. 121 + 122 + Security considerations 123 + ------------------------ 124 + 125 + An adversarial packfile where objects are (computationally intensive, even for 126 + SHA-1 as vulnerable as it is) constructed to have the same prefix for the 127 + relevant object format hash algorithm could be used to fill up the bloom 128 + filters, rendering some buckets useless. In the worst case, if they somehow 129 + fill all filters, this proposal's optimizations become useless, but would not 130 + be a significant DoS vector.