Use Case - PDF to Image#

有的时候我们希望在 PDF 上进行涂画, 然后给人类看 (例如用方框将重要元素框起来). 这时我们可以先将 PDF 转化为 PNG, 然后再用 Python pillow 对图像进行修改. 本文给出了示例代码.

Sample Document (W2 form from IRS):

https://github.com/MacHu-GWU/learn_pylib-project/assets/6800411/25807794-fa7a-4a71-b27b-be001096a69f
 1# -*- coding: utf-8 -*-
 2
 3"""
 4Convert PDF to Image in Python
 5
 6We use `pdf2image <https://pypi.org/project/pdf2image/>`_ library.
 7
 8How to install ``pdf2image``:
 9
10First, you do ``pip install pdf2image``.
11
12Then you install the poppler CLI, ``pdf2image`` uses poppler under the hood.
13
14Mac:
15
16- Install `poppler for Mac <https://macappstore.org/poppler/>`_
17- do ``brew install poppler``
18- use ``brew list poppler`` to figure out the poppler bin folder, on my computer it is ``/opt/homebrew/Cellar/poppler/22.08.0/bin/``
19
20Linux (Redhat):
21
22- Install poppler for Linux ``sudo yum install poppler-utils``
23- Check it is installed ``yum list poppler-utils``
24"""
25
26import typing as T
27from pathlib import Path
28from pdf2image import convert_from_path
29
30
31def pdf_to_image(
32    path_pdf: Path,
33    dir_images: Path,
34    dpi: int = 144,
35    fmt: str = "png",
36    poppler_path: T.Optional[str] = None,
37) -> T.List[Path]:
38    """
39    :param path_pdf: the path of input PDF
40    :param dir_images: the directory of output images
41    """
42    images = convert_from_path(
43        f"{path_pdf}",
44        dpi=dpi,
45        fmt=fmt,
46        poppler_path=str(poppler_path) if poppler_path else None,
47    )
48    if not dir_images.exists():
49        dir_images.mkdir(parents=True)
50    output_paths = list()
51    for page_num, image in enumerate(images, start=1):
52        path_image = dir_images / f"page-{page_num}.{fmt}"
53        output_paths.append(path_image)
54        image.save(f"{path_image}")
55    return output_paths
56
57
58if __name__ == "__main__":
59    # Sample PDF, W2 form: https://www.irs.gov/pub/irs-pdf/fw2.pdf
60    dir_here = Path(__file__).absolute().parent
61    path_pdf = dir_here / "w2.pdf"
62    dir_images = dir_here / "output"
63    dpi = 150
64    fmt = "png"
65    poppler_path = None
66    pdf_to_image(
67        path_pdf=path_pdf,
68        dir_images=dir_images,
69        dpi=dpi,
70        fmt=fmt,
71        poppler_path=poppler_path,
72    )