TT Epaper LHS
The Telegraph
TT Mobile
 
 
IN TODAY'S PAPER
WEEKLY FEATURES
CITY NEWSLINES
FEEDS
  RSS
  My Yahoo!
SEARCH
 
Archives Web
 
ARCHIVES
Since 1st March, 1999
 
THE TELEGRAPH
 
CIMA Gallary
 
Email This Page
Tech triumphs as you tinker
Imaging: M Iqbal Shaikh

The next time you buy tickets online or sign up for an e-mail account, and you are asked to type in some puzzling strings of letters or numerals, you may be part of a large worldwide exercise. Websites ask people to identify distorted characters just to make sure that they are human and not spammers’ computers trying to sign up for countless e-mail accounts. Millions of people in the world type in such words, probably a few hundred millions a year. Now a professor in Carnegie Mellon University in the US is putting this otherwise wasteful activity into good use, without them ever knowing it: to convert old text into digital form.

For example, The New York Times is the largest and most influential newspaper in the US. It is also one of the oldest, having been established in 1851, and contains invaluable reference material for researchers; the newspaper has covered the Civil War, the World Wars, the Great Depression, presidential elections for over 150 years, and many other major world events. A scanned copy of the newspaper is available from 1851, but a picture is not searchable. You can search for key words in The New York Times only from the 1981 editions, when the newspaper started digitising content. But not for long. “By the end of next year, we would have digitised all content,” says Marc Frons, chief technology officer of The New York Times digital.

The Carnegie Mellon professor, Luis von Ahn, is using hundreds of millions of people to digitise content in The New York Times from 1851. Von Ahn’s main research topic is human computing, the process of using hundreds of millions of people over the Internet to do some task not yet possible for computers. Digitising text is only part of his objective. He is also trying to caption images, hundreds of millions of them, using people power over the Internet, and to transcribe old radio programmes. Those who participate in such activity will not know what they are achieving, but they have other incentives to keep attempting such tasks.

The statistics of ‘useless activity’ over the Internet is truly mindboggling. Here are some of them that Von Ahn is fond of quoting. Over nine billion human hours were spent in the year 2003 playing Solitaire. The Panama Canal was built using 20 million human hours and the Empire State Building was built using seven million human hours. If all the people who played Solitaire in 2003 were available to build the Panama Canal, it could have been built in a day. Can you utilise the lure of online games — or necessary activity like figuring out distorted characters — to do useful work for humanity?

Von Ahn and his colleagues wrote the first computer program that uses distorted words to distinguish between a human being and a computer. Yahoo had asked them to write a program to prevent spamsters from getting millions of free e-mail accounts. Now several companies around the world use the program, called CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), to weed out automated programs masquerading as humans. Now Von Ahn has written a similar program, called reCAPTCHA, that can help digitise text while companies make sure that their user is human.

Digitising text is a complicated activity for a computer. The most widely used method is to scan texts and then use Optical Character Recognition (OCR) software to recognise the text. “OCR is prone to errors,” says Von Ahn. “We have found that it cannot recognise 20 per cent of the words in old faded text.” However, it is an easy task for a human being. And this is what Von Ahn is trying to achieve using reCAPTCHA.

In CAPTCHA, the user is asked to type in a distorted word. In reCAPTCHA, the user is asked to type two distorted words. The first one is a known word, while the second is from the text to be digitised. If the user gets the first one correct, the program accepts it as a human being. The second word then goes to the transcribed text. The program has a method to eliminate errors and get a text that is as accurate as possible. In fact, the accuracy is as good as it gets when a professional transcriber does the job. “Using people to transcribe text is expensive,” says Marc Frons.

Von Ahn is now using reCAPTCHA and thousands of websites to transcribe text for The New York Times and the Internet Archive. The New York Times work is proceeding at the rate of a year’s editions a month, but will speed up soon; about 70 years’ editions will be transcribed through the next year.

Von Ahn’s next job is to look at old radio programmes. He has also devised clever computer games that people enjoy playing, and has been using them to give captions to images. Google has licensed this game to improve its image search. The next time you do a boring job on the Internet or get hooked to a game, there is a good chance that someone is using you to do some useful work.

Top
Email This Page