reCAPTCHA Helping to Digitize Books

25 June, 2010

Professor Luis von Ahn at Thinking Digital

The annual Thinking Digital conference took place at the Sage last month. Two days packed full of innovative talks and great networking opportunities. Seminars covered everything from Creative Commons and citizen journalism to story telling and the impact of sound.

One talk in particular grabbed my attention because it dealt with two subjects close to my heart – websites and books. Professor Luis von Ahn of Carnegie Mellon University took to the stage to talk about reCAPTCHA.

CAPTCHA codes are those pesky codes that you copy when you’re filling in forms online. 200 million of these codes are typed everyday. Their function is to ascertain whether you’re a human being or a bot. Bots and automated programmes can’t read distorted or obscure text. Humans can.


Professor von Ahn worked out that it takes an average of ten seconds to type the codes. Hence around 150,000 hours everyday are spent typing them. It was this colossal perceived ‘waste of time’ that led Professor von Ahn and the team at Carnegie Mellon University to come up with reCAPTCHA.

Each time you type a reCAPTCHA code your are helping to digitize books. Here’s how it works. Books printed before the digital age are scanned and made available in digital format using a scanning technology known as OCR. Unfortunately, for many books the print has either deteriorated or is too obscure to be read by a computer. Each of these illegible words is embedded into an image and used as a CAPTCHA code. Therefore, when you’re copying the words in a reCAPTCHA you’re deciphering them for OCR.

But if the computer can’t read these words, how do they know that you have typed them correctly? Here’s how, explained rather succinctly on the reCAPTCHA website:

‘Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.’ Accuracy is estimated at .1 / .2 %.


Professor Luis von Ahn Talking About reCAPTCHA

Further Reading:

Top image courtesy of the Thinking Digital Flickr group.

Posted by | Posted in Events and Conferences, Interesting Stuff | 2 Comments

2 Responses to reCAPTCHA Helping to Digitize Books

  1. SallyF says:

    Wow Jenny that’s amazing I love that! Wouldn’t it be great if it told you what book you were digitsing too? Then we could all proudly share the details of ‘our’ book and spread more of the love of books in the process.

  2. Jenny Hudson says:

    That would be cool.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>