Using OCR to help the Guardian convert PDF's to text

What do you do if someone shares a PDF with you and the PDF contains scanned images of text ? How do you get at that text if you want to copy and paste it, search it, or even edit it ? In short how do you liberate it ?

The Guardian Newspaper’s unsearchable PDFs …

Last Friday (22nd February 2012) the Guardian newspaper found itself in just such an unenviable position after the BBC released a slew of PDF files (relating to an independent review of the BBC’s handling of the Jimmy Saville scandal) containing scanned, un-searchable text. Not exactly the most helpful format for journalists looking to make use of the files in a hurry !

Optical Character Recognition & Zamzar to the rescue

Fortunately Zamzar was able to step in and help – We used specialist OCR (Optical Character Recognition) technology to analyse the 30+ PDF files and produce readable, searchable text.

You can read more about our assistance in a story by the Guardian’s technology editor Charles Arthur : “BBC Pollard inquiry: why is it so hard to search the documents?”

OCR (Optical Character Recognition) technology has a reputation for being costly and difficult to use, so we are pleased to say that we’re currently working hard to make it available on the main Zamzar site so that we can help to liberate more documents ! Do let us know if this might be of interest to you.

Happy Converting !
The Zamzar Team.