videobeaux — hash_fingerprint ================================= ## Description `hash_fingerprint` is a fast, flexible hashing cataloger for media libraries within videobeaux. It computes deterministic hashes and fingerprints to ensure data integrity, verify exports, detect duplicates, and measure perceptual similarity. ### Features - File-level hashes: `md5`, `sha1`, `sha256` (streamed, low RAM) - Stream-level hash: FFmpeg-based hash of decoded content - Frame-level checksum: `framemd5` per frame - Perceptual hash: aHash over sampled frames (Pillow required) - Works on single files or entire directories (recursive) - Outputs to JSON or CSV --- ## Why Use It ### 1. Integrity & Provenance Ensure the exact same content is delivered or archived — detect even one-bit changes. ### 2. Duplicate & Version Control Detect duplicates and content drift across export iterations. ### 3. Codec-Level Comparison FFmpeg’s stream hash reveals content changes even when metadata or bitrates differ. ### 4. Frame-Accurate Verification framemd5 provides true frame-level checksum comparison. ### 5. Perceptual Matching Find visually similar clips using aHash to detect re-encodes or near-duplicates. --- ## Use Cases - Library audits for media integrity - Delivery verification (QC workflows) - Regression testing for re-exports - Duplicate detection - Visual similarity clustering (phash) --- ## Inputs & Outputs **Inputs** - `-i/--input`: file or directory - `--recursive`: traverse directories - `--exts`: filter by extensions **Outputs** - `--catalog`: JSON or CSV catalog path ### Example JSON Record ```json { "path": "/abs/path/to/media/bbb.mov", "size_bytes": 12345678, "file_md5": "…", "file_sha256": "…", "stream_sha256": "…", "framemd5": ["stream, pts, checksum…"], "phash_algo": "aHash", "phash_frames": 124, "phash_list": ["f3a1…", "9b7c…"] } ``` --- ## Key Flags | Flag | Description | |------|--------------| | `--file-hashes` | md5, sha1, sha256 (default: md5 sha256) | | `--stream-hash` | Compute stream hash using FFmpeg | | `--framemd5` | Generate per-frame checksums | | `--phash` | Enable perceptual hashing | | `--phash-fps` | Sample frequency for phash | | `--phash-size` | Hash matrix size (8 → 64-bit, 16 → 256-bit) | | `--catalog` | Output catalog path (.json or .csv) | --- ## Example Commands **Default file hash** ```bash videobeaux -P hash_fingerprint -i ./media/bbb.mov --catalog ./out/outbbb_hashes.json -F ``` **Directory recursive hash** ```bash videobeaux -P hash_fingerprint -i ./media --recursive --exts .mp4 .mov --catalog ./out/outdir_hashes.json -F ``` **Add stream hash** ```bash videobeaux -P hash_fingerprint -i ./media/bbb.mov --stream-hash sha256 --stream-kind video --catalog ./out/outbbb_streamsha.json -F ``` **Frame checksum** ```bash videobeaux -P hash_fingerprint -i ./media/bbb.mov --framemd5 --catalog ./out/outbbb_framemd5.json -F ``` **Perceptual hash** ```bash videobeaux -P hash_fingerprint -i ./media/bbb.mov --phash --phash-fps 1.0 --phash-size 16 --catalog ./out/outbbb_phash.json -F ``` **Compare exports** ```bash videobeaux -P hash_fingerprint -i ./out/v1 --recursive --file-hashes sha256 --catalog ./out/v1_hashes.json -F videobeaux -P hash_fingerprint -i ./out/v2 --recursive --file-hashes sha256 --catalog ./out/v2_hashes.json -F ``` --- ## Performance Notes - File hashes: Fastest, limited by I/O. - Stream hash / framemd5: CPU-intensive (decoding). - Perceptual hashing: Adjustable via fps and size. - Always prefer local disk for large scans. --- ## Best Practices - **Ingest Audit:** `--file-hashes sha256` on daily ingest. - **QC Re-exports:** Add `--stream-hash sha256`. - **Forensic Accuracy:** Use `--framemd5` for exact match. - **Similarity:** Use `--phash --phash-fps 0.5 --phash-size 8` for clustering. --- ## Troubleshooting - Ensure FFmpeg is installed and in PATH. - Install Pillow for `--phash` (`pip install Pillow`). - Create parent directories for output paths. --- ## Security & Determinism - Hashes are deterministic and consistent across systems. - md5 is fast for duplicates; sha256 is more secure. - Stream and frame hashes depend on FFmpeg decoding path. --- ## Future Enhancements - `--verify` mode to compare current files vs stored catalog. - Duplicate-grouping report in JSON/CSV.