Deterministic PDF Normalization & Canonical Hashing API
A production-ready API for turning inconsistent PDF files into stable, comparable outputs. PDFCanon removes risky active content, rebuilds structure, and returns hashes your system can trust.
Problem and Solution
PDFs are structurally chaotic
Two visually identical PDFs can produce different SHA-256 hashes because their internal structure, metadata, and save history differ. That makes ordinary file hashing unreliable for deduplication, tamper checks, archival verification, and audit evidence.
- Embedded JavaScript and active content hidden inside the file
- Incremental update history alters the hash on every re-save
- Hidden metadata and object ordering differences between producers
A strict, deterministic normalization pipeline
PDFCanon applies a deterministic normalization pipeline before hashing. The same input produces the same normalized output, giving your application a stable value to compare and store.
- Active content stripped — no JavaScript, no embedded files
- Structure rebuilt — XRef table, object ordering, incremental updates collapsed
- Stable SHA-256 hash produced — idempotent, auditable, trustworthy
Key Capabilities
The core pieces needed to make PDF integrity checks repeatable.
Active Content Removal
JavaScript, embedded files, rich media, and AcroForms are stripped or flattened during normalization.
Structural Canonicalization
Object ordering, XRef rebuilds, incremental update collapse, and metadata stripping reduce producer-specific differences.
Stable SHA-256 Hashing
Returns both original and normalized SHA-256 values so your system can compare source files and canonical outputs separately.
Compliance-Ready Reports
A machine-readable JSON report documents what was detected and changed during normalization.
Multi-Tenant API
Organization isolation, API key management, webhook support, and billing support for production SaaS usage.
Usage-Based Pricing
Usage-based pricing with a free tier, so teams can test the API before committing production volume.
Who Uses PDFCanon
PDFCanon is built for teams that accept, store, compare, or audit PDFs from outside their own systems.
-
SaaS platformsAccepting PDF uploads from untrusted sources
-
E-sign companiesVerifying document integrity before and after signing
-
Fintech & HR SaaSOnboarding pipelines with document deduplication
-
Legal techTamper detection and chain-of-custody requirements
-
Government contractorsDocument submission audit trail requirements
-
Compliance-driven teamsSOC 2, ISO 27001, and regulatory audit evidence
Technical Details
Implementation details for technical evaluators.
API example
POST https://api.pdfcanon.com/v1/normalize
Authorization: Bearer pk_live_••••••••
Content-Type: multipart/form-data
// Response 200 OK
{
"id": "nrm_01j8x7k...",
"status": "success",
"original_sha256": "a3f4b7c2...",
"normalized_sha256": "e8c1d290...",
"report": { ... }
}
Technology stack
-
High-performance cloud backendHigh-throughput normalization with minimal latency
-
qpdf toolchainIndustry-standard PDF structural transformation
-
Schema-level tenant isolationEach organization's data is fully partitioned
-
Cloud object storageScalable, redundant storage for normalized outputs
Evaluate PDFCanon with your own files
Start at pdfcanon.com, review the current free tier, and test the API against the PDF workflows your product already handles.