Is there a way to digitally markup a pdf so its not OCR-readable?

cheese_greater@lemmy.world · 3 months ago

Is there a way to digitally markup a pdf so its not OCR-readable?

cannedtuna@lemmy.world · 3 months ago

OCR cannot scan documents that have been certified or digitally signed.

Note that once you certify a document it can no longer be edited, combined with another PDF, or have pages inserted or extracted.

Once a PDF has been digitally signed it is locked and you can no longer add pages, delete pages, or read it via OCR.

MystikIncarnate@lemmy.ca · 3 months ago

This works, right up until you introduce PDF compatible software that doesn’t give a shit about your rules, of which there’s plenty.

You can also print/scan, or even print to PDF to get around such limitations. The original document cannot be altered since that would invalidate the digital signature on the file, but you can create a perfect digital copy, omitting the signature, and modify it however you want.

If online systems that are skimming documents for their contents don’t give a shit about what the signature is, and simply take a copy and OCR it to train an AI or amalgamate the information for data harvesting or other purposes.

I get what you’re saying and in concept, it should be fine, the problem is that it’s a software lock/restriction on a file type that isn’t inherently closed source, unknown, nor was the PDF format built to be secure from the ground up. So we’re applying security to a system that wasn’t built for it.

GBU_28@lemm.ee · 3 months ago

What? If the document is accessible, and human readable, it’s parsable by OCR

cannedtuna@lemmy.world · 3 months ago

I don’t know what to tell you dude. A certified or digitally can’t/wont be read by OCR. A digitally signed document legally certifies that the document has not been modified. PDF editors such as Bluebeam or Adobe will not or cannot process a certified or digitally signed document.

I’m not sure if that limitation is due to the process by which the document is certified or if it is a feature of software conforming for legality reasons. I’m not going to research this for OP, I’m just providing a simple and best accurate answer.

Maybe current AI has better abilities to process document text? I’m not sure, maybe. But you’d think this would be a shared concern with groups wanting to protect documents for the same reason and therefore encryption would match.

If it’s just the legality of it stopping a company from providing the feature, you would think most companies would want to keep out of legal hot water and would then disallow OCR processing. In this case sure there could be software that doesn’t conform, but for most application purposes I don’t think you’d have to worry too much.

WolfLink@sh.itjust.works · 3 months ago

It’s 100% a software limitation and you absolutely can screen capture and OCR it.

BCsven@lemmy.ca · 3 months ago

Lots of software can manipulate PDF. Open PDF in libredraw change pages,print as PDF or export as PDF. A system that skims contentiss purposely going to bypass and signed restriction

GBU_28@lemm.ee · 3 months ago

Many alternative OCR tools now simply screenshot the page. This is a cracked issue.