How deduplication works using File Hashes

💻 Environment/Context

  • Deduplication
  • Hashing
  • Collection

 

 

❓ Issue/Question

  • How does merging duplicates work on the ScorePlay platform based on file hash?

 

👌 Resolution/Answer

  • A hash function (like SHA-256, SHA-1, or MD5) takes any input — in this case, the file's binary data — and produces a fixed-length string called a hash value or digest.
    • Example:
      • File A → SHA-256 → "9a0364b9e99bb480dd25e1f0284c8555" File B → SHA-256 → "9a0364b9e99bb480dd25e1f0284c8555"
  • If two files produce the same hash, they're assumed to have identical content.

 

  • Hash functions are designed so that:
    • Even a tiny change in the file (one bit) gives a completely different hash.
    • It's computationally infeasible to find two different files with the same hash (a "collision").
  • When a new file is uploaded or saved:
    • The system computes its hash (e.g., SHA-256).
    • It checks whether that hash already exists in the database or storage index.
  • If the hash is found:
    • The system knows this file already exists.
    • It avoids storing the duplicate again (it may just reference the existing file).
  • If the hash is new:
    • The system stores the file and records its hash.

 

 

  • Benefits:
    • Saves storage: Identical files are stored once.
    • Improves performance: Fast lookups instead of comparing file contents byte by byte.
    • Integrity checks: Hashes can detect data corruption or tampering.
  • Caveats:
    • Hash collisions: Rare but possible (especially with weaker hashes like MD5).
    • Performance cost: Computing hashes on very large files takes CPU time.
    • Partial deduplication: Some systems use block-level hashing (dividing files into chunks) for more efficient deduplication of large, similar files.
  • Exceptions in ScorePlay logic:
    • On files >20G, hash don't work properly
    • Files are deduplicated when same hash in found in the same collection (it does not apply cross-collections).

 

  • Example in practice
    • In a cloud storage system:
      • User A uploads photo.png → hash = 123abc
      • User B uploads the same photo.png → same hash 123abc
      • System recognizes the duplicate and stores just one copy, pointing both users' metadata to the same storage object.

 

 

🔖 Notes 

  •