Files
videobeaux/program-docs/docs-hash_fingerprint.txt
2025-11-28 22:26:57 -05:00

162 lines
4.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
videobeaux — hash_fingerprint
=================================
## Description
`hash_fingerprint` is a fast, flexible hashing cataloger for media libraries within videobeaux.
It computes deterministic hashes and fingerprints to ensure data integrity, verify exports, detect duplicates, and measure perceptual similarity.
### Features
- File-level hashes: `md5`, `sha1`, `sha256` (streamed, low RAM)
- Stream-level hash: FFmpeg-based hash of decoded content
- Frame-level checksum: `framemd5` per frame
- Perceptual hash: aHash over sampled frames (Pillow required)
- Works on single files or entire directories (recursive)
- Outputs to JSON or CSV
---
## Why Use It
### 1. Integrity & Provenance
Ensure the exact same content is delivered or archived — detect even one-bit changes.
### 2. Duplicate & Version Control
Detect duplicates and content drift across export iterations.
### 3. Codec-Level Comparison
FFmpegs stream hash reveals content changes even when metadata or bitrates differ.
### 4. Frame-Accurate Verification
framemd5 provides true frame-level checksum comparison.
### 5. Perceptual Matching
Find visually similar clips using aHash to detect re-encodes or near-duplicates.
---
## Use Cases
- Library audits for media integrity
- Delivery verification (QC workflows)
- Regression testing for re-exports
- Duplicate detection
- Visual similarity clustering (phash)
---
## Inputs & Outputs
**Inputs**
- `-i/--input`: file or directory
- `--recursive`: traverse directories
- `--exts`: filter by extensions
**Outputs**
- `--catalog`: JSON or CSV catalog path
### Example JSON Record
```json
{
"path": "/abs/path/to/media/bbb.mov",
"size_bytes": 12345678,
"file_md5": "…",
"file_sha256": "…",
"stream_sha256": "…",
"framemd5": ["stream, pts, checksum…"],
"phash_algo": "aHash",
"phash_frames": 124,
"phash_list": ["f3a1…", "9b7c…"]
}
```
---
## Key Flags
| Flag | Description |
|------|--------------|
| `--file-hashes` | md5, sha1, sha256 (default: md5 sha256) |
| `--stream-hash` | Compute stream hash using FFmpeg |
| `--framemd5` | Generate per-frame checksums |
| `--phash` | Enable perceptual hashing |
| `--phash-fps` | Sample frequency for phash |
| `--phash-size` | Hash matrix size (8 → 64-bit, 16 → 256-bit) |
| `--catalog` | Output catalog path (.json or .csv) |
---
## Example Commands
**Default file hash**
```bash
videobeaux -P hash_fingerprint -i ./media/bbb.mov --catalog ./out/outbbb_hashes.json -F
```
**Directory recursive hash**
```bash
videobeaux -P hash_fingerprint -i ./media --recursive --exts .mp4 .mov --catalog ./out/outdir_hashes.json -F
```
**Add stream hash**
```bash
videobeaux -P hash_fingerprint -i ./media/bbb.mov --stream-hash sha256 --stream-kind video --catalog ./out/outbbb_streamsha.json -F
```
**Frame checksum**
```bash
videobeaux -P hash_fingerprint -i ./media/bbb.mov --framemd5 --catalog ./out/outbbb_framemd5.json -F
```
**Perceptual hash**
```bash
videobeaux -P hash_fingerprint -i ./media/bbb.mov --phash --phash-fps 1.0 --phash-size 16 --catalog ./out/outbbb_phash.json -F
```
**Compare exports**
```bash
videobeaux -P hash_fingerprint -i ./out/v1 --recursive --file-hashes sha256 --catalog ./out/v1_hashes.json -F
videobeaux -P hash_fingerprint -i ./out/v2 --recursive --file-hashes sha256 --catalog ./out/v2_hashes.json -F
```
---
## Performance Notes
- File hashes: Fastest, limited by I/O.
- Stream hash / framemd5: CPU-intensive (decoding).
- Perceptual hashing: Adjustable via fps and size.
- Always prefer local disk for large scans.
---
## Best Practices
- **Ingest Audit:** `--file-hashes sha256` on daily ingest.
- **QC Re-exports:** Add `--stream-hash sha256`.
- **Forensic Accuracy:** Use `--framemd5` for exact match.
- **Similarity:** Use `--phash --phash-fps 0.5 --phash-size 8` for clustering.
---
## Troubleshooting
- Ensure FFmpeg is installed and in PATH.
- Install Pillow for `--phash` (`pip install Pillow`).
- Create parent directories for output paths.
---
## Security & Determinism
- Hashes are deterministic and consistent across systems.
- md5 is fast for duplicates; sha256 is more secure.
- Stream and frame hashes depend on FFmpeg decoding path.
---
## Future Enhancements
- `--verify` mode to compare current files vs stored catalog.
- Duplicate-grouping report in JSON/CSV.