티스토리 뷰
Michael has a large pile of handwritten notepads that he would like to convert into Microsoft Word documents
Your handwrittedn notes would be more useful in Microsoft Word format because you could do lots of things with them.
I have many A4 pads of handwritten notes, which I would like to convert into Microsoft Word documents. To type them all in would take a very long time. I've noticed that Google's ability to read text from photos has vastly implroved in recent months. Are you aware of a tool from Google or anyone else that can do a good job of this, please? Michael
The idea of converting written or printed text into digital text is generally called OCR for optical character recognition, and it has similar problems to speech recognition. That is to say, if the input is close to perfect, the output can also be close to perfect
But in practice, it works best when dealing with restricted inputs and/or limited domains. For example, it's possible to recognise the English names for numbers and the names of major UK cities, especially if you can get people to write each letter in its own little box. The same software wouldn't have the domain expertise to cope with a Russian-speaking coroner who liked to include Sanskirt quotations in his hadwritten autopsies
Handwritting matters
OCR works best with high-quality printed materials and worst of all with hand writing, so you're not starting from the best position. In my experience, you can only get handwriting recognition to work well enough by doing it in real time. That enables you to train the software to recognise your input, while the software also trains you to write characters in ways that it can understand. I've had some success with this approach, starting more than a decade ago with Microsoft OneNote (which can also record your voice in sync) running on Windows XP Tablet Edition, and more recently with a Livescribe Echo digital pen and MyScript software. However, all this has more to do with keyboard replacement strategies than with OCR
It's generally agreed that the best OCR programs are Abbyy FineReader and Nuance's OmniPage and Ultimate though neither is suitable for cursive handwritting recognition. Both companies offer free trial versions so you can test them before you splash out. There's also CharacTell's SoftWriting, which the company says is for students taking notes in class and professionals taking notes in meetings. But it also says it is designed "for recognising non-connected handwritting and machine-printed text"(their emphasis) so I would't bet on it reading your handwritten notes
Like most if not all the programs in this field, SoftWriting has to be trained to recognise your handwriting. When it is processing a document, it will present you with words it doesn't recognise, so that you can tell it what they are. If you have 250 words on a page and the program miraculously gets 90% of them right, you will still have to correct 25 words
If you want to try a few pages as an experiment, then you can download FreeOCR for Windows, though be careful not to install any crapware that may be included. FreeOCR is based on the widely used Tesseract OCR engine, which was originally developed by Hewlett-Packard in England in the 1980's. HP made it open source in 2005, and Google now maintains the source code.
You can also use FreeOCR online by uploading PDF files to free-orc.com. Google Docs and various other services also use the same Tesseract OCR engine.
Wikipedia warns that "Tesseract's output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters"
PDFs and scanners
your handwritten notes would be more useful in Microsoft Word format because you could do lots of things with them. For example, you could chage the typeface, size and spacing, correct and amend your notes, add illustrations, and so on. But unless you have extremely neat, clear and very consistent handwriting, that probably won't be possible. Instead, think about converting them to high-quality, scanned PDF files that you can store on a hard drive or in the cloud
You can feed these PDF files to OCR software and hope that it will recognize enough words to make your notes searchable. If not, you will probably have to tag them manually. Either way, if someone does come up with an OCR program that can read your handwritting - not impossible, though I've already waited 30 years for one - you will be ready with sharp PDF files, rather than curling originals where the paper has aged and the ink has faded
Of course, if you are going to scan your notes then you must already have a scanner, or be prepared to buy one. A cheap Epson or Canon flat-bed scanner should give good results, though it is time -consuming to scan a lot of pages. If you intend to do a lot of scanning, consider a sheet-fed model like the Brother ADS-2100. You can also get scanners that include OCR, such as Fujitsu ScanSnap iX500 Duplex, which scans both sides of the paper at once. (The scanner's OCR software usually runs on your PC.)
Scanning services
If you have to buy a decent scanner and perhaps good quality OCR software for a one-off project, add up the cost and divide it by the number of pages of notes to find the cost per page. It's a boring job, so perhaps you should add the cost of your time. The result might prompt you to abandon the whole idea, or start looking for a company to do it for you
Most of the companies that provide scanning services cater for business that need to clear away large volumns of paper records. However, some cater for low-volumn and home users. One example is Oxford-based Scanning Geeks, which charges 25p per page for documents up to A3 in size. (One page means one side of a page.) They can do OCR ("Textual Data Capture") as well. Ideally, find a good local company where you can drop off your notes securely and collect them afterwards
It's an expensive route if you have lots of paper: it could cost 3,000 pounds to scan the contents of a four-drawer filing cabinet. But if you only have 100 to 500 pages of notes to scan, it could be the best option
could be the best option
the contents of a four-drawer filing cabinet
could cost 3,000 pounds to scan
an expensive route if you have lots of paper
where you can drop off your notes securely and collect them afterwards
ideally, find a good local company
textual data capture
one page means one side of a page
which charges 25p per page for documents up to A3 in size
some cater for low-volumn and home users
to clear away large volumn of paper records
provide scanning services cater for business
The result mught prompt you to abandon the whole idea
you should add the cost of your time
add up the cost and divide it by the number of pages of notes to find the cost per page
for a one-off project
a decent scanner
runs on your PC
which scans both sides of the paper at once
consider a sheet-fed model
intend to do a lot of scanning
it is time-consuming to scan a lot of pages
should give good results
A cheap Epson or Canon flat-bed scanner
be prepared to buy one
to scan tour notes
where the paper has aged and the ink has faded
rather than curling originals
will be ready with sharp PDF files
have already waited 30 years for one
does come up with an OCR program that can read your handwritting - not impossible
have to tag them manually
to make your notes searchable
can feed these PDF files to OCR software
can store on a hard or in the cloud
scanned PDF files
very consistent handwritting
amend your notes
could change the typeface, size and spacing
they will be misinterpreted as characters
must be manually removed
Tesseract's binarization stage will destroy much of the page
low-frequency changes in brightness must be high-pass filtered
any rotation or skew must be corrected
must be scaled up such that the text x-height
are not preprocessed to suit it
warns that Thesseract's output will be very poor quality
various other services
maintains the source code
which was originally developed by Hewlett
the widely used Tesseract OCR engine
be careful not to install any crapware that may be included
as and experiment
miraculously gets 90% of them right
will present you with words it doesn't recognize
when it is processing a document
has to be trained to recognize your handwritting
Like most if not all the programs in this field
I wouldn't bet on it
their emphasis
it is designed for recognizing non-connected handwritting
before you splash out
though neither is suitable for cursive handwritting recognition
has more to do with keyboard replacement strategies than with OCR
record your voice in sync
have had some succeess with this approach
the software also trains you to write characters in ways that it can understand
enables you to train the software to recognize your input
by doing it in real time
starting from the best position
worst of all with handwritting
works best with high-quality printed materials
in his handwritten autopsies
the domain expertise to cope with a Russian-speaking coroner
to write each letter in its own little box
is possible to recognize the English names
restriced inputs
when dealing with restricted inputs and limited domains
the input is close to perfect
speech recognition
generally called OCR for optical character recognition
converting written or printed text into digital text
are you aware of a tool
has vastly improved in recent months
could do lots of things with them
would like to convert into Microsoft Word documents
a large pile of handwritten notepads