Google Now Indexes Scanned Documents

Google has announced that it is now indexing scanned documents. It is now able to perform OCR on any scanned documents that is stored in Adobe's PDF format. This Optical Character Recognition (OCR) technology is able to convert a picture (of a thousand words) into a thousand words -- words that can be searched and indexed, so that these valuable documents are more easily found.

This is a small but important step forward in making all the world's information accessible and useful.

Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however -- it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages.

To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.

To see the new system at work, click on these search queries. Note the document excerpt in the search results, along with the full text presented after the 'View as HTML' link:

[repairing aluminum wiring]
[spin lock performance]

Read the complete post at Official Google Blog: A picture of a thousand words?

No comments:

Computer work at home

Computer work at home
computer work at home

Search

More Blog Posts