Content-Defined Chunking, Rolling Hash, Deduplication Demo

Data Stream

Data Stream 2

Rolling Hash Info

Current Window:

Current Hash (Dec): 0

Chunking Status

Boundary Found? No

Hash (32-bit Binary):

Leading Zeros: (Target: )

Condition: Leading Zeros >= Target

Deduplication Stats (Stream 2)

Total Chunks: 0

Unique Chunks: 0

Duplicate Chunks: 0

Space Saved: 0%

Generated Chunks (Stream 1)

Generated Chunks (Stream 2)

How it Works (Leading Zeros Method):

  1. A sliding window moves over the data. At each step, a rolling hash is calculated for the window's content.
  2. The tool converts the hash value into its 32-bit binary representation.
  3. It counts the number of **leading zeros** in this binary string.
  4. If the Count of Leading Zeros is greater than or equal to your chosen Target Leading Zeros (n), a **chunk boundary** is created.
  5. This is a probabilistic method. Searching for $n$ leading zeros means the probability of finding a boundary at any specific point is $1/2^n$. This results in an expected average chunk size of $2^n$ bytes.
  6. Deduplication: When the second stream is processed, the hash of each new chunk is compared against a list of hashes from the first stream. If a match is found, the chunk is marked as a **duplicate** and doesn't need to be stored again, saving space.