Time and again, you may come across a PDF with a pesky watermark on every other page making the text difficult to read. You can use Adobe Acrobat Pro or PDF-XChange Editor (Windows only) to manually edit the PDF and remove each watermark.
However, if you have more than twenty pages then a programmatic approach is far more efficient. The first step is determining whether the watermark is “text-based” or “image-based.” To do this, you can inspect your PDF by reviewing its content stream — a wall of text that defines the PDF and everything in it.
I recommend programmatically inspecting the watermark in the original document and from a slice of that document. Try not to export a single page. Why, you may ask? If you export a page then the encoding can change depending on your viewer.
from PyPDF2 import PdfReader
def inspect_pdf(input_pdf):
reader = PdfReader(input_pdf)
for page_num, page in enumerate(reader.pages):
print(f"\n--- Page {page_num + 1} ---")
if "/Contents" in page:
contents = page["/Contents"]
if isinstance(contents, list): # Multiple streams
for content in contents:
stream_data = content.get_object().get_data().decode("utf-8", errors="ignore")
print(stream_data)
else: # Single stream
stream_data = contents.get_object().get_data().decode("utf-8", errors="ignore")
print(stream_data)
# Replace with your PDF path
inspect_pdf("path/to/file/here.pdf")
The code above will produce a stream for each page. I recommend sampling a few pages and adjusting the code accordingly. As you inspect the stream, look for your watermark by searching for its text. If it’s text-based, it will have the Tj
(or TJ
) operator.
BT
/F1 12 Tf
1 0 0 1 100 700 Tm
MyGreatWatermark Tj
ET
On the other hand, if the watermark is image-based then there are a few additional steps. PDFs store images in a /Resources
dictionary as XObject
. You’ll have to identify the watermark in the Resources dictionary and then, find its identifier in the PDF stream by examining “q” blocks.
q
456 0 0 342.75 74.77 421.4599 cm
/Im1 Do
Q
Once you have determined whether your watermark is text-based or image-based then you can start programmatically making changes using the PyPDF2 library.
I used ChatGPT to help me come up with the code including the inspect_pdf.py
. In my case, the watermark was text-based so the code in the gist removes text-based watermarks. The approach is straightforward; once you identify the line to remove, re-write the PDF without that line. That’s it.
For me, the excitement came from doing detective work to determine how the watermark was created. The rest was a matter of execution.
Here’s the gist of the watermark removal code.
Leave a Reply