💻 Environment/Context
- Deduplication
- Hashing
- Collection
❓ Issue/Question
- How does merging duplicates work on the ScorePlay platform based on file hash?
👌 Resolution/Answer
- A hash function (like SHA-256, SHA-1, or MD5) takes any input — in this case, the file's binary data — and produces a fixed-length string called a hash value or digest.
- Example:
- File A → SHA-256 → "9a0364b9e99bb480dd25e1f0284c8555" File B → SHA-256 → "9a0364b9e99bb480dd25e1f0284c8555"
- Example:
- If two files produce the same hash, they're assumed to have identical content.
- Hash functions are designed so that:
- Even a tiny change in the file (one bit) gives a completely different hash.
- It's computationally infeasible to find two different files with the same hash (a "collision").
- When a new file is uploaded or saved:
- The system computes its hash (e.g., SHA-256).
- It checks whether that hash already exists in the database or storage index.
- If the hash is found:
- The system knows this file already exists.
- It avoids storing the duplicate again (it may just reference the existing file).
- If the hash is new:
- The system stores the file and records its hash.
- Benefits:
- Saves storage: Identical files are stored once.
- Improves performance: Fast lookups instead of comparing file contents byte by byte.
- Integrity checks: Hashes can detect data corruption or tampering.
- Caveats:
- Hash collisions: Rare but possible (especially with weaker hashes like MD5).
- Performance cost: Computing hashes on very large files takes CPU time.
- Partial deduplication: Some systems use block-level hashing (dividing files into chunks) for more efficient deduplication of large, similar files.
- Exceptions in ScorePlay logic:
- On files >20G, hash don't work properly
- Files are deduplicated when same hash in found in the same collection (it does not apply cross-collections).
- Example in practice
- In a cloud storage system:
- User A uploads photo.png → hash = 123abc
- User B uploads the same photo.png → same hash 123abc
- System recognizes the duplicate and stores just one copy, pointing both users' metadata to the same storage object.
- In a cloud storage system: