Item 43070738

quuxplusone • 5 days ago

Copyright issues aside (e.g. if your thing is public domain), the galaxy-brain approach is to upload your raw scanned PDF to the Internet Archive (archive.org), fill in the appropriate metadata, wait about 24 hours for their post-upload format-conversion tasks to run automatically, and then download the size-optimized and OCR-ized PDF from them.

I've done this with a few documents from the French and Spanish national archives, which were originally provided as enormous non-OCRed PDFs but shrank to 10% the size (or less) after passage through archive.org and incidentally became full-text-searchable.

huijzer • 5 days ago

Last time I checked a few months ago, LLMs were more accurate than the OCR that the archive is using. The web archive version is/was not using context to figure out that for example “in the garden was a trge” should be “in the garden was a tree”. LLMs depending on the prompt do this.

1 reply

quuxplusone • 4 days ago

Perhaps. My perhaps-curmudgeonly take on that is that it sounds a bit like "Xerox scanners/photocopiers randomly alter numbers in scanned documents" ( https://news.ycombinator.com/item?id=29223815 ). I'd much rather deal with "In the garden was a trge" than "In the garden was a tree," for example, if what the page actually said was "In the garden was a tiger." That said, of course you're right that context is useful for OCRing. See for example https://history.stackexchange.com/questions/50249/why-does-n...

Another, perhaps-leftpaddish argument is that by outsourcing the job to archive.org I'm allowing them to worry about the "best" way to OCR things, rather than spending my own time figuring it out. Wikisource, for example, seems to have gotten markedly better at OCRing pages over the past few years, and I assume that's because they're swapping out components behind the scenes.

1 reply

huijzer • 3 days ago

Fair enough. Very valid points. I guess it boils down to “test both systems and see what works best for the task at hand”. I can indeed imagine cases were your approach would be the better option for sure.

rahimnathwani • 4 days ago

The PDFs this process creates use MRC (Mixed Raster Content), which separates each page into multiple layers: a black and white foreground layer for text/line art, a color background layer for images/colors, and a binary mask layer that controls how they're combined. This smart layering is why you can get such small file sizes while maintaining crisp text and reasonable image quality.

If you want purely black and white output (e.g. if the PDF has yellowing pages and/or not-quite-black text, but doesn't have many illustrations), you can extract just the monochrome foreground layer from each page and ignore the color layers entirely.

First, extract the images using mutool extract in.pdf

Then delete the sRGB images.

Then combine the remaining images with imagemagick command line: convert -negate *.png out.pdf

This gives you a clean black and white PDF without any of the color information or artifacts from the background layer.

Here's a script that does all that. It worked with two different PDFs from IA. I haven't tested it with other sources of MRC PDFs. The script depends on mutool and imagemagick.

https://gist.github.com/rahimnathwani/44236eaeeca10398942d2c...