티스토리 뷰

Michael has a large pile of handwritten notepads that he would like to convert into Microsoft Word documents


Your handwrittedn notes would be more useful in Microsoft Word format because you could do lots of things with them.


I have many A4 pads of handwritten notes, which I would like to convert into Microsoft Word documents. To type them all in would take a very long time. I've noticed that Google's ability to read text from photos has vastly implroved in recent months. Are you aware of a tool from Google or anyone else that can do a good job of this, please? Michael


The idea of converting written or printed text into digital text is generally called OCR for optical character recognition, and it has similar problems to speech recognition. That is to say, if the input is close to perfect, the output can also be close to perfect


But in practice, it works best when dealing with restricted inputs and/or limited domains. For example, it's possible to recognise the English names for numbers and the names of major UK cities, especially if you can get people to write each letter in its own little box. The same software wouldn't have the domain expertise to cope with a Russian-speaking coroner who liked to include Sanskirt quotations in his hadwritten autopsies


Handwritting matters

OCR works best with high-quality printed materials and worst of all with hand writing, so you're not starting from the best position. In my experience, you can only get handwriting recognition to work well enough by doing it in real time. That enables you to train the software to recognise your input, while the software also trains you to write characters in ways that it can understand. I've had some success with this approach, starting more than a decade ago with Microsoft OneNote (which can also record your voice in sync) running on Windows XP Tablet Edition, and more recently with a Livescribe Echo digital pen and MyScript software. However, all this has more to do with keyboard replacement strategies than with OCR


It's generally agreed that the best OCR programs are Abbyy FineReader and Nuance's OmniPage and Ultimate though neither is suitable for cursive handwritting recognition. Both companies offer free trial versions so you can test them before you splash out. There's also CharacTell's SoftWriting, which the company says is for students taking notes in class and professionals taking notes in meetings. But it also says it is designed "for recognising non-connected handwritting and machine-printed text"(their emphasis) so I would't bet on it reading your handwritten notes


Like most if not all the programs in this field, SoftWriting has to be trained to recognise your handwriting. When it is processing a document, it will present you with words it doesn't recognise, so that you can tell it what they are. If you have 250 words on a page and the program miraculously gets 90% of them right, you will still have to correct 25 words


If you want to try a few pages as an experiment, then you can download FreeOCR for Windows, though be careful not to install any crapware that may be included. FreeOCR is based on the widely used Tesseract OCR engine, which was originally developed by Hewlett-Packard in England in the 1980's. HP made it open source in 2005, and Google now maintains the source code.


You can also use FreeOCR online by uploading PDF files to free-orc.com. Google Docs and various other services also use the same Tesseract OCR engine.


Wikipedia warns that "Tesseract's output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters"


PDFs and scanners

your handwritten notes would be more useful in Microsoft Word format because you could do lots of things with them. For example, you could chage the typeface, size and spacing, correct and amend your notes, add illustrations, and so on. But unless you have extremely neat, clear and very consistent handwriting, that probably won't be possible. Instead, think about converting them to high-quality, scanned PDF files that you can store on a hard drive or in the cloud


You can feed these PDF files to OCR software and hope that it will recognize enough words to make your notes searchable. If not, you will probably have to tag them manually. Either way, if someone does come up with an OCR program that can read your handwritting - not impossible, though I've already waited 30 years for one  - you will be ready with sharp PDF files, rather than curling originals where the paper has aged and the ink has faded


Of course, if you are going to scan your notes then you must already have a scanner, or be prepared to buy oneA cheap Epson or Canon flat-bed scanner should give good results, though it is time -consuming to scan a lot of pages. If you intend to do a lot of scanning, consider a sheet-fed model like the Brother ADS-2100. You can also get scanners that include OCR, such as Fujitsu ScanSnap iX500 Duplex, which scans both sides of the paper at once. (The scanner's OCR software usually runs on your PC.)


Scanning services

If you have to buy a decent scanner and perhaps good quality OCR software for a one-off project, add up the cost and divide it by the number of pages of notes to find the cost per page. It's a boring job, so perhaps you should add the cost of your time. The result might prompt you to abandon the whole idea, or start looking for a company to do it for you


Most of the companies that provide scanning services cater for business that need to clear away large volumns of paper records. However, some cater for low-volumn and home users. One example is Oxford-based Scanning Geeks, which charges 25p per page for documents up to A3 in size. (One page means one side of a page.) They can do OCR ("Textual Data Capture") as well. Ideally, find a good local company where you can drop off your notes securely and collect them afterwards


It's an expensive route if you have lots of paper: it could cost 3,000 pounds to scan the contents of a four-drawer filing cabinet. But if you only have 100 to 500 pages of notes to scan, it could be the best option



could be the best option

the contents of a four-drawer filing cabinet

could cost 3,000 pounds to scan

an expensive route if you have lots of paper

where you can drop off your notes securely and collect them afterwards

ideally, find a good local company

textual data capture

one page means one side of a page

which charges 25p per page for documents up to A3 in size

some cater for low-volumn and home users

to clear away large volumn of paper records

provide scanning services cater for business

The result mught prompt you to abandon the whole idea

you should add the cost of your time

add up the cost and divide it by the number of pages of notes to find the cost per page

for a one-off project

a decent scanner

runs on your PC

which scans both sides of the paper at once

consider a sheet-fed model

intend to do a lot of scanning

it is time-consuming to scan a lot of pages

should give good results

A cheap Epson or Canon flat-bed scanner

be prepared to buy one

to scan tour notes

where the paper has aged and the ink has faded

rather than curling originals

will be ready with sharp PDF files

have already waited 30 years for one

does come up with an OCR program that can read your handwritting - not impossible

have to tag them manually

to make your notes searchable

can feed these PDF files to OCR software

can store on a hard or in the cloud

scanned PDF files

very consistent handwritting

amend your notes

could change the typeface, size and spacing

they will be misinterpreted as characters

must be manually removed

Tesseract's binarization stage will destroy much of the page

low-frequency changes in brightness must be high-pass filtered

any rotation or skew must be corrected

must be scaled up such that the text x-height

are not preprocessed to suit it

warns that Thesseract's output will be very poor quality

various other services

maintains the source code

which was originally developed by Hewlett

the widely used Tesseract OCR engine

be careful not to install any crapware that may be included

as and experiment

miraculously gets 90% of them right

will present you with words it doesn't recognize

when it is processing a document

has to be trained to recognize your handwritting

Like most if not all the programs in this field

I wouldn't bet on it

their emphasis

it is designed for recognizing non-connected handwritting

before you splash out

though neither is suitable for cursive handwritting recognition

has more to do with keyboard replacement strategies than with OCR

record your voice in sync

have had some succeess with this approach

the software also trains you to write characters in ways that it can understand

enables you to train the software to recognize your input

by doing it in real time

starting from the best position

worst of all with handwritting

works best with high-quality printed materials

in his handwritten autopsies

the domain expertise to cope with a Russian-speaking coroner

to write each letter in its own little box

is possible to recognize the English names

restriced inputs

when dealing with restricted inputs and limited domains

the input is close to perfect

speech recognition

generally called OCR for optical character recognition

converting written or printed text into digital text

are you aware of a tool

has vastly improved in recent months

could do lots of things with them

would like to convert into Microsoft Word documents

a large pile of handwritten notepads



댓글
반응형
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2024/11   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
글 보관함