Using OCR to help the Guardian convert PDF’s to text
What do you do if someone shares a PDF with you and the PDF contains scanned images of text ? How do you get at that text if you want to copy and paste it, search it, or even edit it ? In short how do you liberate it ?
The Guardian Newspaper’s unsearchable PDFs …
Last Friday (22nd February 2012) the Guardian newspaper found itself in just such an unenviable position after the BBC released a slew of PDF files (relating to an independent review of the BBC’s handling of the Jimmy Saville scandal) containing scanned, un-searchable text. Not exactly the most helpful format for journalists looking to make use of the files in a hurry !
Optical Character Recognition & Zamzar to the rescue
You can read more about our assistance in a story by the Guardian’s technology editor Charles Arthur : “BBC Pollard inquiry: why is it so hard to search the documents?“
OCR (Optical Character Recognition) technology has a reputation for being costly and difficult to use, so we are pleased to say that we’re currently working hard to make it available on the main Zamzar site so that we can help to liberate more documents ! Do let us know if this might be of interest to you.