mirror of
https://github.com/vondas-network/videobeaux.git
synced 2025-12-05 23:40:04 +01:00
162 lines
4.3 KiB
Plaintext
162 lines
4.3 KiB
Plaintext
videobeaux — hash_fingerprint
|
||
=================================
|
||
|
||
## Description
|
||
|
||
`hash_fingerprint` is a fast, flexible hashing cataloger for media libraries within videobeaux.
|
||
It computes deterministic hashes and fingerprints to ensure data integrity, verify exports, detect duplicates, and measure perceptual similarity.
|
||
|
||
### Features
|
||
- File-level hashes: `md5`, `sha1`, `sha256` (streamed, low RAM)
|
||
- Stream-level hash: FFmpeg-based hash of decoded content
|
||
- Frame-level checksum: `framemd5` per frame
|
||
- Perceptual hash: aHash over sampled frames (Pillow required)
|
||
- Works on single files or entire directories (recursive)
|
||
- Outputs to JSON or CSV
|
||
|
||
---
|
||
|
||
## Why Use It
|
||
|
||
### 1. Integrity & Provenance
|
||
Ensure the exact same content is delivered or archived — detect even one-bit changes.
|
||
|
||
### 2. Duplicate & Version Control
|
||
Detect duplicates and content drift across export iterations.
|
||
|
||
### 3. Codec-Level Comparison
|
||
FFmpeg’s stream hash reveals content changes even when metadata or bitrates differ.
|
||
|
||
### 4. Frame-Accurate Verification
|
||
framemd5 provides true frame-level checksum comparison.
|
||
|
||
### 5. Perceptual Matching
|
||
Find visually similar clips using aHash to detect re-encodes or near-duplicates.
|
||
|
||
---
|
||
|
||
## Use Cases
|
||
|
||
- Library audits for media integrity
|
||
- Delivery verification (QC workflows)
|
||
- Regression testing for re-exports
|
||
- Duplicate detection
|
||
- Visual similarity clustering (phash)
|
||
|
||
---
|
||
|
||
## Inputs & Outputs
|
||
|
||
**Inputs**
|
||
- `-i/--input`: file or directory
|
||
- `--recursive`: traverse directories
|
||
- `--exts`: filter by extensions
|
||
|
||
**Outputs**
|
||
- `--catalog`: JSON or CSV catalog path
|
||
|
||
### Example JSON Record
|
||
```json
|
||
{
|
||
"path": "/abs/path/to/media/bbb.mov",
|
||
"size_bytes": 12345678,
|
||
"file_md5": "…",
|
||
"file_sha256": "…",
|
||
"stream_sha256": "…",
|
||
"framemd5": ["stream, pts, checksum…"],
|
||
"phash_algo": "aHash",
|
||
"phash_frames": 124,
|
||
"phash_list": ["f3a1…", "9b7c…"]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Key Flags
|
||
|
||
| Flag | Description |
|
||
|------|--------------|
|
||
| `--file-hashes` | md5, sha1, sha256 (default: md5 sha256) |
|
||
| `--stream-hash` | Compute stream hash using FFmpeg |
|
||
| `--framemd5` | Generate per-frame checksums |
|
||
| `--phash` | Enable perceptual hashing |
|
||
| `--phash-fps` | Sample frequency for phash |
|
||
| `--phash-size` | Hash matrix size (8 → 64-bit, 16 → 256-bit) |
|
||
| `--catalog` | Output catalog path (.json or .csv) |
|
||
|
||
---
|
||
|
||
## Example Commands
|
||
|
||
**Default file hash**
|
||
```bash
|
||
videobeaux -P hash_fingerprint -i ./media/bbb.mov --catalog ./out/outbbb_hashes.json -F
|
||
```
|
||
|
||
**Directory recursive hash**
|
||
```bash
|
||
videobeaux -P hash_fingerprint -i ./media --recursive --exts .mp4 .mov --catalog ./out/outdir_hashes.json -F
|
||
```
|
||
|
||
**Add stream hash**
|
||
```bash
|
||
videobeaux -P hash_fingerprint -i ./media/bbb.mov --stream-hash sha256 --stream-kind video --catalog ./out/outbbb_streamsha.json -F
|
||
```
|
||
|
||
**Frame checksum**
|
||
```bash
|
||
videobeaux -P hash_fingerprint -i ./media/bbb.mov --framemd5 --catalog ./out/outbbb_framemd5.json -F
|
||
```
|
||
|
||
**Perceptual hash**
|
||
```bash
|
||
videobeaux -P hash_fingerprint -i ./media/bbb.mov --phash --phash-fps 1.0 --phash-size 16 --catalog ./out/outbbb_phash.json -F
|
||
```
|
||
|
||
**Compare exports**
|
||
```bash
|
||
videobeaux -P hash_fingerprint -i ./out/v1 --recursive --file-hashes sha256 --catalog ./out/v1_hashes.json -F
|
||
videobeaux -P hash_fingerprint -i ./out/v2 --recursive --file-hashes sha256 --catalog ./out/v2_hashes.json -F
|
||
```
|
||
|
||
---
|
||
|
||
## Performance Notes
|
||
|
||
- File hashes: Fastest, limited by I/O.
|
||
- Stream hash / framemd5: CPU-intensive (decoding).
|
||
- Perceptual hashing: Adjustable via fps and size.
|
||
- Always prefer local disk for large scans.
|
||
|
||
---
|
||
|
||
## Best Practices
|
||
|
||
- **Ingest Audit:** `--file-hashes sha256` on daily ingest.
|
||
- **QC Re-exports:** Add `--stream-hash sha256`.
|
||
- **Forensic Accuracy:** Use `--framemd5` for exact match.
|
||
- **Similarity:** Use `--phash --phash-fps 0.5 --phash-size 8` for clustering.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
- Ensure FFmpeg is installed and in PATH.
|
||
- Install Pillow for `--phash` (`pip install Pillow`).
|
||
- Create parent directories for output paths.
|
||
|
||
---
|
||
|
||
## Security & Determinism
|
||
|
||
- Hashes are deterministic and consistent across systems.
|
||
- md5 is fast for duplicates; sha256 is more secure.
|
||
- Stream and frame hashes depend on FFmpeg decoding path.
|
||
|
||
---
|
||
|
||
## Future Enhancements
|
||
|
||
- `--verify` mode to compare current files vs stored catalog.
|
||
- Duplicate-grouping report in JSON/CSV.
|