Files
videobeaux/docs/_site/programs/utilities/hash_fingerprint.md
2025-12-07 22:04:44 -05:00

4.1 KiB
Raw Permalink Blame History

hash_fingerprint

:contentReference[oaicite:0]{index=0}

Description

Creates unique fingerprints of video files using checksums, perceptual hashing, or frame-level hashing.
Useful for deduplication, archival identification, similarity matching, and database cataloging.

Purpose

hash_fingerprint allows Videobeaux users to generate machine-identifiable signatures from media files.
This supports:

  • duplicate detection,
  • content-based indexing,
  • frame-level comparison,
  • large-scale archival workflows,
  • catalog metadata generation,
  • cross-system verification of assets.

How It Works

  1. File System Scanning
    • recursive allows walking entire folder trees.
    • exts restricts scanning to specific file types.
  2. Hash Types
    The tool can generate several forms of fingerprints:
    • file_hashes → whole-file digests (MD5, SHA1, etc.)
    • stream_hash → stream-level checksums from container metadata
    • framemd5 → per-frame MD5s for high-precision comparison
    • phash → perceptual hash used for similarity matching rather than byte-exact comparison
  3. Perceptual Hashing Controls
    • phash_fps determines how many frames per second are sampled.
    • phash_size sets the resolution of the perceptual hash grid.
  4. Catalog Output
    A fingerprint catalog can be generated for long-term storage, search systems, or dataset builds.
  5. Stream Selection
    stream_kind allows selecting video/audio/subtitle streams depending on the hashing method.

Program Template

videobeaux -P hash_fingerprint \
  -i input.mp4 \
  -o output.mp4 \
  --recursive VALUE \
  --exts VALUE \
  --file_hashes VALUE \
  --stream_hash VALUE \
  --framemd5 VALUE \
  --phash VALUE \
  --phash_fps VALUE \
  --phash_size VALUE \
  --catalog VALUE \
  --stream_kind VALUE

Arguments

  • recursive — Enables recursive folder scanning for batch fingerprinting.
  • exts — Comma-separated extensions to include (e.g., mp4,mov,mkv).
  • file_hashes — Generates byte-level file digests for exact-match identification.
  • stream_hash — Computes hash digests for individual media streams.
  • framemd5 — Produces an MD5 hash for every decoded frame; extremely precise but large.
  • phash — Enables perceptual hashing for similarity comparisons.
  • phash_fps — Number of frames per second to sample for phash generation.
  • phash_size — Resolution of the phash grid (larger = more detail).
  • catalog — Outputs results into a catalog file for later lookup or indexing.
  • stream_kind — Specifies which stream type to fingerprint (e.g., v, a, s).

Real World Example

videobeaux -P hash_fingerprint \
  -i myvideo.mp4 \
  -o hash_fingerprint_styled.mp4 \
  --recursive false \
  --exts mp4,mov \
  --file_hashes true \
  --stream_hash true \
  --framemd5 false \
  --phash true \
  --phash_fps 1 \
  --phash_size 32 \
  --catalog true \
  --stream_kind v

Technical Notes

  • File hashes ensure perfect binary-level identification, but cannot detect visually similar variants.
  • Perceptual hashing (phash) is ideal for detecting duplicates that differ by transcoding, compression, or scaling.
  • framemd5 is extremely accurate but produces large files; best for forensic comparison.
  • Catalog files allow large-scale search across thousands of items.
  • Adjust phash_fps and phash_size to balance between accuracy and performance.
  • Archival fingerprinting for media libraries.
  • Deduplication of large video collections.
  • Detecting alternate encodes of the same content.
  • Forensic verification and tamper detection.
  • Preparing similarity datasets for AI/ML workflows.

Quality Tips

  • Use phash_fps=13 for good coverage without heavy overhead.
  • Higher phash_size (e.g., 3264) improves discrimination of similar videos.
  • Use framemd5 only when exact frame-level matching is required.
  • Always include catalog=true for batch processing or long-term reference.