PDF files can range from a few kilobytes to hundreds of megabytes, depending on content and compression. Understanding compression algorithms helps you make informed decisions about file size, quality, and compatibility. This guide explains the science behind PDF compression and how to choose the right approach for your needs.

The Two Categories of Compression

Lossless Compression

Lossless compression reduces file size without losing any data. When decompressed, the file is bit-for-bit identical to the original. This is essential for text, line art, and any content where perfect accuracy matters.

Common lossless algorithms in PDFs:

Flate (Deflate/ZIP): The most common PDF compression method, based on the same algorithm as ZIP files. Excellent for text and simple graphics. Typically achieves 2:1 to 3:1 compression ratios.
LZW (Lempel-Ziv-Welch): An older algorithm that was popular before Flate. Less efficient than Flate but still supported for compatibility.
Run-Length Encoding (RLE): Simple algorithm that works well for images with large areas of solid color, like screenshots or diagrams.

Lossy Compression

Lossy compression achieves much smaller file sizes by permanently discarding some data. The key is discarding information that humans won't notice. This is appropriate for photographs and complex graphics, but never for text or technical drawings.

Common lossy algorithms in PDFs:

JPEG (DCT): The standard for photographic images. Can achieve 10:1 to 20:1 compression with minimal visible quality loss. However, repeated compression degrades quality (generation loss).
JPEG2000: A more advanced image codec that offers better quality at the same file size as JPEG. Supports both lossy and lossless modes. Not as widely supported as standard JPEG.
JBIG2: Designed specifically for black-and-white scanned documents. Can achieve 20:1 to 100:1 compression on text-heavy scans. However, aggressive JBIG2 compression can alter text characters, which has legal implications.

How PDF Compression Actually Works

Text Compression

PDF text is stored as character codes plus positioning information. Flate compression works by finding repeated patterns in this data and replacing them with shorter codes.

Example: The phrase "the the the" appears frequently in English text. Instead of storing "the" three times, Flate might store "the" once and a code meaning "repeat previous word twice." This pattern-matching works across the entire document, finding repeated words, phrases, and even formatting codes.

Image Compression

Images in PDFs can be compressed using different algorithms depending on image type:

Photographs: JPEG compression divides the image into 8x8 pixel blocks, converts them to frequency domain using Discrete Cosine Transform (DCT), then discards high-frequency details that humans barely perceive.
Screenshots and diagrams: Flate compression works better than JPEG because these images have sharp edges and solid colors. JPEG's block-based approach creates visible artifacts around text and lines.
Scanned documents: JBIG2 identifies similar-looking characters (like all the letter "e"s) and stores a single template, then references it throughout the document. This is why JBIG2 achieves such dramatic compression on text.

Object Stream Compression

PDF 1.5 introduced object streams, which group multiple PDF objects together before compression. This is more efficient than compressing each object individually because the compression algorithm can find patterns across objects.

Real-world impact: A PDF with 1,000 small objects might compress to 60% of its original size with object streams, versus only 80% without them.

Downsampling: The Hidden Compression Technique

Downsampling reduces image resolution before compression. This is often the most effective way to reduce PDF file size.

When to downsample:

Screen viewing: 150 DPI is sufficient for most monitors. Higher resolution just wastes file size.
Office printing: 300 DPI is the standard for laser printers. Higher resolution is imperceptible.
Professional printing: 300-600 DPI depending on the printing process. Consult your print shop.

Example: A scanned document at 600 DPI might be 50 MB. Downsampling to 300 DPI (still print-quality) reduces it to 12.5 MB before any compression is applied. Then applying JBIG2 compression might bring it down to 2 MB—a 96% reduction with no visible quality loss for the intended use.

Font Subsetting and Embedding

Fonts can significantly impact PDF file size. A complete font file might be 200 KB, but most documents only use 50-100 characters from that font.

Font subsetting includes only the characters actually used in the document. This reduces a 200 KB font to perhaps 10 KB.

Trade-off: Subsetted fonts prevent editing the PDF with new characters. If you might need to add text later, embed the full font. For final, read-only documents, always subset.

Compression Strategies by Document Type

Text-Heavy Documents (Reports, Contracts)

Use Flate compression for all content
Subset and embed fonts
Enable object stream compression
Expected result: 70-90% size reduction

Scanned Documents

Downsample to 300 DPI for mixed content, 600 DPI for line art
Use JBIG2 for black-and-white scans (with caution—verify text accuracy)
Use JPEG at quality 80-85 for color scans
Apply OCR to make searchable
Expected result: 90-98% size reduction

Photo-Heavy Documents (Portfolios, Catalogs)

Downsample images to appropriate resolution (150 DPI for screen, 300 DPI for print)
Use JPEG compression at quality 80-90
Consider JPEG2000 for critical images where quality is paramount
Compress text and vector elements with Flate
Expected result: 60-85% size reduction

Technical Drawings (CAD, Engineering)

Never use lossy compression—it can alter dimensions
Use Flate compression only
Keep vector graphics as vectors (don't rasterize)
Embed fonts fully if dimensions or annotations might be edited
Expected result: 30-60% size reduction

The Danger of Over-Compression

Aggressive compression can have unintended consequences:

JBIG2 text substitution: In 2013, researchers discovered that aggressive JBIG2 compression could change numbers in scanned documents. A "6" might become an "8" because the algorithm decided they were "similar enough." This has serious legal implications.
JPEG artifacts: Compressing text or line art with JPEG creates visible "mosquito noise" around edges. Always use lossless compression for text.
Generation loss: Each time you compress a JPEG image, quality degrades. If you need to edit and re-save a PDF multiple times, start with high-quality images.

Tools and Settings

Our PDF Compression tool uses intelligent algorithms to automatically select the best compression method for each element in your document. It:

Detects image types and applies appropriate compression
Preserves text quality with lossless compression
Subsets fonts automatically
Enables object stream compression
Provides a quality slider for user control

Future of PDF Compression

Emerging technologies are pushing PDF compression further:

AVIF and WebP in PDFs: These modern image formats offer better compression than JPEG. PDF 2.0 specification allows for new image formats, though support is still limited.
AI-powered compression: Machine learning models can predict which image details are perceptually important, achieving better quality at lower file sizes.
Adaptive compression: Future tools might analyze how a PDF will be used (screen vs. print) and optimize accordingly.

Understanding compression algorithms empowers you to make informed decisions about file size and quality. The best compression strategy depends on your content type, intended use, and distribution method. When in doubt, test different settings and compare the results visually before committing to a compression approach.

PDF Compression Algorithms Explained: Lossless vs Lossy in 2026