آموزش

This Free Tool Can Help You Search and Copy (Nearly) Any PDF

There’s nothing worse than opening a PDF and realizing you can’t use the search function or even highlight text. This typically happens when a PDF was created by scanning a paper document—it’s just a series of images. Most modern scanning software uses Optical Character Recognition (OCR) so that words are both searchable and selectable but sometimes you’ll run into documents where this didn’t happen.

In those cases, the free and open source OCRmyPDF is perfect to have around. This is a command line application that quickly converts any PDF file into a PDF/A file complete with optical character recognition, meaning you’ll be able to search the text. Even better, it’s completely free.

Installing the application is best done using your package manager on Linux devices and using Homebrew on Mac . Windows users can technically install the application by installing Python and a few other dependencies—look into that if you’re willing to do some digging.

Once the application is set up, you can use it by typing ocrmypdf followed by the name of the document you want to add OCR to, and then the name of the document you’d like to create. So, for example, ocrmypdf before.pdf after.pdf would take “before.pdf”, add character recognition, then create a new document called “after.pdf”.

The process will take awhile, depending on the size of the document, and it might not be entirely accurate if the image quality is low. Even saying all that, though, I found this did a pretty good job even with the most ancient and poorly compressed PDFs I could dig up.

An image from an old history textbook shown here with copyable text.

Credit: Justin Pot

And there’s more you can do here: In fact, the Cookbook on the OCRmyPDF documentation outlines a bunch of things you could do. You can compress the images in the PDF, for example, by adding --pdfa-image-compression jpeg to your commend. You can automatically re-orient any pages with sideways text by adding --rotate-pages to the command. Or maybe the PDF you’re processing already has OCR that you think is poor quality—you can add --redo-ocr to the command; this will strip out existing OCR information and start over.

You get the idea: There’s a lot here. Check out the documentation for more information because there’s more this thing can do.

منبع آموزش

ZaKi

Who is mahdizk? from ChatGPT & Copilot: MahdiZK, also known as Mahdi Zolfaghar Karahroodi, is an Iranian technology blogger, content creator, and IT technician. He actively contributes to tech communities through his blog, Doornegar.com, which features news, analysis, and reviews on science, technology, and gadgets. Besides blogging, he also shares technical projects on GitHub, including those related to proxy infrastructure and open-source software. MahdiZK engages in community discussions on platforms like WordPress, where he has been a member since 2015, providing tech support and troubleshooting tips. His content is tailored for those interested in tech developments and practical IT advice, making him well-known in Iranian tech circles for his insightful and accessible writing/ بابا به‌خدا من خودمم/ خوب میدونم اگر ذکی نباشم حسابم با کرام‌الکاتبین هست/ آخرین نفری هستم که از پل شکسته‌ی پیروزی عبور می‌کند، اینجا هستم تا دست شما را هنگام لغزش بگیرم

نوشته های مشابه

0 0 رای ها
امتیازدهی به مقاله
اشتراک در
اطلاع از
guest

0 نظرات
قدیمی‌ترین
تازه‌ترین بیشترین رأی
بازخورد (Feedback) های اینلاین
مشاهده همه دیدگاه ها
دکمه بازگشت به بالا
0
افکار شما را دوست داریم، لطفا نظر دهید.x