I’m helping digitize part of The New York Times!

Captcha or reCaptcha … using Captcha security to correct OCR mistakes.

OK … we all know what a captcha is … that little box with numbers and letters you have to fill out in registration forms on websites to prove you  aren’t a robot with malicious intent. If you didn’t know … Captcha stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”.

Yes … hackers have been beating the Captcha for a while now … fact is I’ve noticed they are getting harder to read and some of them have 2 sets of words.

So all of you who have seen and used this 2 word captcha give yourself a big pat on the back because you’ve been helping to digitize libraries of ancient books and newspapers.

The new project called reCaptcha was the idea of Carnegie Mellon University Assistant Professor of Computer Science Luis von Ahn in conjunction with the New York Times, which is digitizing newspapers going back to 1851, and a nonprofit called the Internet Archive, which is digitizing thousands of books.

If you’ve ever user optical character recognition you’ve seen that there are some words that just don’t get recognized correctly. It becomes even harder when the text is old newsprint or ancient books with faded and yellowed text.

That’s where reCapthcha comes in. More than 40,000 web sites including large volume sites like Ticketmaster, Facebook, Craigslist … are using the 2 word reCaptcha system. One of the words is the actual keyword to get verified but the other word … which is hard to read … is from a newspaper or book that OCR couldn’t decide on. How it works is they accept any word you put in but the word input most becomes the OCR correction.

According to Marc Frons CTO of the Times they are digitizing about 2 years worth of old newspapers every month.

So far they have been able to digitize 1.3 BILLION words.

Although we hate having to type in anything on those websites … reCaptcha is less irritating than Captcha as it is easier to type real words than random sets of numbers and letters.

Cool huh? Now we know our time wasn’t totally wasted!

Its estimated that 200 million Captcha’s are typed in every day.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s