···11+Packfile bloom filter RFC
22+=========================
33+44+Problem
55+-------
66+77+Especially for server-side usages, repacking is extremely expensive, and
88+creating multi-pack-indexes is still rather expensive. Incremental MIDX
99+partially solves this, but would defeat the purpose of MIDX when there are too
1010+many of them, as Git would still have to walk the MIDXes in order while
1111+performing expensive indexing queries.
1212+1313+Idea
1414+----
1515+1616+Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and
1717+ordinary .idx files are still traversed in their usual order, but the first
1818+step when traversing them, is to check whether that index could possibly have
1919+the desired object, through a bloom filter.
2020+2121+We will want the filters to be mmaped, and we want the lookup cost to be
2222+dominated by one cache-line read rather than using many scattered reads.
2323+Therefore, a blocked bloom filter is likely the right direction here. The steps
2424+are as follows:
2525+2626+1. Split the filter into 64-octet buckets, since 64 octets is the most common
2727+ cache-line size.
2828+2. Use some bits of the object ID to choose the bucket.
2929+3. Use the rest of the key to choose several bit positions inside that bucket.
3030+4. A lookup thus reads one 64-octet bucket and checks whether all required bits
3131+ are set.
3232+3333+Note on Object IDs
3434+------------------
3535+3636+Git object IDs are cryptographic hashes (e.g., currently either SHA-256
3737+or SHA-1), and are thus uniformly distributed in non-pathological scenarios.
3838+See also the "Security considerations" section.
3939+4040+Definitions
4141+-----------
4242+4343+Let:
4444+4545+ B := number of buckets
4646+ K := number of bits set and tested per object ID
4747+4848+* All integers here are big endian.
4949+* The OID is to be interpreted as a big-endian bitstring, where bit offset 0
5050+ is the most significant bit of octet 0.
5151+* log2(B) + 9K <= hash length in bits.
5252+5353+File layout
5454+-----------
5555+5656+* 4-octet signature: {'I', 'D', 'B', 'L'}
5757+* 4-octet version identifier (= 1)
5858+* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256)
5959+* 4-octet B (number of buckets)
6060+* 2-octet K (number of bits set and tested per object ID)
6161+* 6-octet padding (must be all zeros)
6262+* B buckets of 64 octets each.
6363+6464+Validation
6565+----------
6666+6767+* Matching signature
6868+* Supported version (the rest of the rules are for this version)
6969+* Hash function identifier must be recognized
7070+* B must be nonzero and a power of two
7171+* K must be nonzero
7272+* log2(B) + 9K <= hash length in bits
7373+* Padding must be all zero
7474+* File size must be 24 + 64 * B octets
7575+7676+Lookup procedure
7777+----------------
7878+7979+1. Let b be the unsigned integer encoded by the most significant log2(B) bits
8080+ of OID. B is a power of two, and 0 <= b < B.
8181+2. Select and read bucket b.
8282+3. For each 0 <= i < K:
8383+ 1. Start immediately after the most significant log2(B) bits of OID, let the
8484+ i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the
8585+ next 9 * K bits of the OID.
8686+ 2. Let pi be the unsigned integer encoded by that 9-bit field.
8787+ Then, 0 <= pi < 512.
8888+ 3. Compute wi := pi >> 6, and bi := pi & 63.
8989+ Thus, wi identifies one of the 8 64-bit words in bucket b, and bi
9090+ identifies one bit within that word.
9191+ 4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit
9292+ word, bit index 0 denotes the most significant bit, and bit index 63
9393+ denotes the least significant bit.)
9494+9595+If any test fails, the OID is definitely not in the relevant idx.
9696+If all tests succeed, the OID may be in the relevant idx.
9797+9898+Note that two of the K 9-bit fields can decode to the same pi, which means an
9999+insertion may set fewer than K distinct bits.
100100+101101+Worked example
102102+--------------
103103+104104+Let:
105105+106106+ B = 1 << 15 = 32768
107107+ K = 8
108108+109109+Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket
110110+and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of
111111+87 bits taken from the object ID.
112112+113113+1. Read the first 15 bits of OID and interpret them as b, where
114114+ 0 <= b < 32768.
115115+2. Read bucket b.
116116+3. For each 0 <= i < 8:
117117+ 1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it
118118+ as pi, where 0 <= pi < 512.
119119+ 2. Compute: wi = pi >> 6, bi = pi & 63.
120120+ 3. Test whether bit bi is set in the word wi of bucket b.
121121+122122+Security considerations
123123+------------------------
124124+125125+An adversarial packfile where objects are (computationally intensive, even for
126126+SHA-1 as vulnerable as it is) constructed to have the same prefix for the
127127+relevant object format hash algorithm could be used to fill up the bloom
128128+filters, rendering some buckets useless. In the worst case, if they somehow
129129+fill all filters, this proposal's optimizations become useless, but would not
130130+be a significant DoS vector.