Use Case - PDF to Image#
有的时候我们希望在 PDF 上进行涂画, 然后给人类看 (例如用方框将重要元素框起来). 这时我们可以先将 PDF 转化为 PNG, 然后再用 Python pillow 对图像进行修改. 本文给出了示例代码.
Sample Document (W2 form from IRS):
1# -*- coding: utf-8 -*-
2
3"""
4Convert PDF to Image in Python
5
6We use `pdf2image <https://pypi.org/project/pdf2image/>`_ library.
7
8How to install ``pdf2image``:
9
10First, you do ``pip install pdf2image``.
11
12Then you install the poppler CLI, ``pdf2image`` uses poppler under the hood.
13
14Mac:
15
16- Install `poppler for Mac <https://macappstore.org/poppler/>`_
17- do ``brew install poppler``
18- use ``brew list poppler`` to figure out the poppler bin folder, on my computer it is ``/opt/homebrew/Cellar/poppler/22.08.0/bin/``
19
20Linux (Redhat):
21
22- Install poppler for Linux ``sudo yum install poppler-utils``
23- Check it is installed ``yum list poppler-utils``
24"""
25
26import typing as T
27from pathlib import Path
28from pdf2image import convert_from_path
29
30
31def pdf_to_image(
32 path_pdf: Path,
33 dir_images: Path,
34 dpi: int = 144,
35 fmt: str = "png",
36 poppler_path: T.Optional[str] = None,
37) -> T.List[Path]:
38 """
39 :param path_pdf: the path of input PDF
40 :param dir_images: the directory of output images
41 """
42 images = convert_from_path(
43 f"{path_pdf}",
44 dpi=dpi,
45 fmt=fmt,
46 poppler_path=str(poppler_path) if poppler_path else None,
47 )
48 if not dir_images.exists():
49 dir_images.mkdir(parents=True)
50 output_paths = list()
51 for page_num, image in enumerate(images, start=1):
52 path_image = dir_images / f"page-{page_num}.{fmt}"
53 output_paths.append(path_image)
54 image.save(f"{path_image}")
55 return output_paths
56
57
58if __name__ == "__main__":
59 # Sample PDF, W2 form: https://www.irs.gov/pub/irs-pdf/fw2.pdf
60 dir_here = Path(__file__).absolute().parent
61 path_pdf = dir_here / "w2.pdf"
62 dir_images = dir_here / "output"
63 dpi = 150
64 fmt = "png"
65 poppler_path = None
66 pdf_to_image(
67 path_pdf=path_pdf,
68 dir_images=dir_images,
69 dpi=dpi,
70 fmt=fmt,
71 poppler_path=poppler_path,
72 )