OCR (Optical Character Recognition) and its application

2021/09/2116:43:14 technology 1488

What is Optical Character Recognition?

OCR (Optical Character Recognition) and its application - DayDayNews


Optical Character Recognition (OCR) is a type of machine-encoded text (institutionalized text) that converts pdf , Word, Excel or text images into data AI tools.


With OCR , a large number of paper-based, multi-format, multiple forms of documents can be digitized into machine-readable text, which not only makes storage easier, but also convenient in various systems Enter data, call and analyze.


Imagine how many file boxes full of documents are in the basement of a city, government, university, or hospital.

OCR (Optical Character Recognition) and its application - DayDayNews


How does OCR work?

Different fonts and methods of writing individual characters make this problem a challenge. Before choosing the OCR algorithm, the image must be preprocessed so that the image can be "read".


l preprocessing

OCR software usually "preprocess" the image to increase the chance of recognition.

technologies include:

1. De-skew (correction)

If the document is not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise to create completely horizontal or vertical lines of text.

2. Remove noise

Remove noise,Smooth edges

3. Binaryization

Convert the image to black and white (called "binary image" because there are two colors). binarization The task of is to distinguish text from the background as a simple and accurate method.

4. Eliminate lines

Clean up non-symbol boxes and lines.

5. Layout analysis or "partition"

identifies columns, paragraphs, headings, etc. as blocks. Especially useful in multi-column layouts and tables.

6. Line character detection

Establish the shape baseline of words and characters, and divide words as needed.

7. Script recognition

In multi-language documents, scripts may be converted at the word level, so before using related OCR to manage specific scripts, script identification is crucial.

8. Character isolation or "segmentation"

For OCR characters, the various characters of the image link should be segmented, and a single character should be segmented into several artifact-based segments for linking.

9. Normalize

Normalize aspect ratio and scale.


l Feature extraction

There are two main methods for extracting features in OCR:

1. The feature detection algorithm defines characters by evaluating the lines and strokes of the characters.

2. The working principle of pattern recognition is to recognize the entire character.

We can identify a line of text by searching for a line of white pixels with black pixels in the middle. Similarly, we can identify where the characters start and end.


The following figure shows the visual effects of these methods:

OCR (Optical Character Recognition) and its application - DayDayNews


(Method 1: feature detection p0

0 p0 p0) p

p0 p0 Pattern recognition for a line of text)

OCR (Optical Character Recognition) and its application - DayDayNews


(Method 2: Pattern recognition of a single character)


Next, we convert the image of the character into a binary matrix, which The white pixel is 0 and the black pixel is 1, as shown in the following figure:

OCR (Optical Character Recognition) and its application - DayDayNews


(sample of binary matrix)


Then, using the distance formula, we can find the distance from the center of the matrix to The farthest distance is 1.

OCR (Optical Character Recognition) and its application - DayDayNews


(distance formula)


Then we create a circular radius and divide it into finer grained parts.


At this stage, the algorithm compares each segment with the matrix database representing different font characters,To determine the most common character statistically.


By performing such processing on each line and each character, it makes it easy for printed or other unstructured data sources to form a digital world.

OCR (Optical Character Recognition) and its application - DayDayNews


(compare each segment with the matrix database)


l post-processing

if there is a vocabulary list in the document (document) Limit, you can improve the accuracy of OCR. For example, restriction is a professional vocabulary in a specific field.


To improve accuracy, there is a free OCR library online.


The output stream can be a single string or character file, but more advanced OCR systems preserve the original page structure, for example, create a PDF containing original image pages and searchable text images.


l error correction

"Near Neighbor Analysis" can use the frequency of co-occurrence to correct errors by noticing that some words have appeared together. For example, "Washington, D.C." is more common in English than "Washington DOC".


l grammar

grammar can also help determine the scanned data, for example, a word may be a verb or a noun, providing greater accuracy.

OCR (Optical Character Recognition) and its application - DayDayNews


OCR use case

OCR engine has developed into a series of OCR applications in specific fields,Including receipts , invoices,

checks and legal documents

l Data entry for business documents, such as checks, passports, invoices, bank statements and receipts.

l Automatic license plate recognition

l At the airport, passport recognition and information extraction

l Automatic insurance document key information extraction

l Extract business card information to the contact list

l Perform large print files Digital version processing, such as book scanning

l make electronic images of printed documents searchable, such as google books

l real-time conversion of handwriting to control the computer (pen calculation)


Classified by industry OCR use cases for

l Bank

Ø The banking industry, like other economic sectors such as insurance and securities, are important consumers of OCR.

Ø The most common use of OCR is to properly manage checks:

Ø Handwritten checks are scanned

Ø Contents are converted into digital text

Ø Signature verification

Ø Real-time clearing check

Although the check is almost printed It requires 100% accuracy (only signature verification needs to match a pre-existing database), but full handwriting recognition still has a long way to go.


However, as deep learning artificial intelligence methods are applied to OCR handwriting, it may not be as unsolvable as it seems.


From the payer to the bank to the payee, reducing check clearing processing time is an advantage for everyone.

OCR (Optical Character Recognition) and its application - DayDayNews


l law

Few industries can produce as much paperwork as the legal industry, so OCR has multiple applications here.


Use the simplest OCR reader to digitize, store, database and search all printed documents: affidavits, judgments, documents, declarations, wills, etc.


This technique is also suitable for Chinese, Arabic and other writing records.


For an industry that relies heavily on history, quickly obtaining legal documents from millions of past cases is undoubtedly an advantage.


l Healthcare

Another industry that works well with OCR is healthcare. The entire medical history can be scanned and stored on the computer: medical reports, x-rays, disease records, treatment or diagnosis, tests, hospital records, insurance payments, etc. These can all be accessed in one place and can be searched.


In fact, the entire hospital’s records are stored digitally, which has great benefits for epidemiology and logistics (maintaining proper pharmacies, equipment, and other consumer goods).

OCR (Optical Character Recognition) and its application - DayDayNews


(OCR for pharmaceutical industry applications)

l supply chain

In the food, beverage, pharmaceutical and cosmetic industries, the quality control of each link is for compliance with safety and anti-counterfeiting compliance Vital.


The item must be within the control of the supply chain at any given moment,And there is information about its source and location.

Although product tracking is generally considered a barcode application, OCR allows you to read the batch number, expiration date and serial number to track the product at all stages of the packaging cycle-from packaging labeling to palletizing operations.

barcode and OCR are often used together to maximize the accuracy of information collection.

OCR (Optical Character Recognition) and its application - DayDayNews


Of course, there is also international freight forwarding in the process of delivery, packing list, bill of lading , invoice, SI, health certificate, arrival notice, declaration element, VGM Documents such as receipts, receipts, bank water bills, etc., all appear as non-institutional data, which can be identified and structured through OCR.


OCR (Optical Character Recognition) and its application - DayDayNews


Benefits of OCR

Powerful:

You can save your f, the simplest file in .pdf, .pdf, etc. , OCR helps convert into readable text. These files can be easily searched and utilized using any system.


Editability:

You may want to amend an old contract written a few years ago, or amend an old will. After using OCR to digitize a file, you can easily edit it with a word processor instead of typing the entire file.


Accessibility:

OCR scanned files are accessible on a public database,This is especially useful for banks, because they can view the customer’s previous credit history anytime, anywhere.


Another purpose is to make government files public so that your land and property ownership records or your grandfather's birth certificate can be found instantly anywhere.


storability:

Digitization reduces the space required for storage from the entire room (if not a "room") to bytes on the server, increasing productivity and saving space.


backup:

Compared with keeping expensive paper copies, digital backups can be made very cheaply and may be unlimited.


Translatability:

Modern OCR can manage a large number of languages, from Arabic to Indian to Chinese. This means that papers in one language can be searched, digitized and translated into any other language. Therefore, we can almost eliminate the need for professional translation.


How OCR will help your business

OCR has several advantages as a means of digitization. In business, there are often a lot of data and documents, whether it is about contracts, waybills, government forms, permits, certificates, price lists, catalogs, etc.


After digitizing, you can compare them with several other digital documents. Therefore, by comparing documents, you can easily get the best prices, services, terms and conditions, etc.


By using OCR, you can check the difference in the original terms and conditions of the contract you signed. same,Checks can also check the quantity, invoices can also be compared, and so on.


In addition, through digital documents, you can access them for the latest analysis, prompting you how to improve, tax avoidance, and the real financial situation.


These are actually the advantages of digitalization. OCR may be a key step in digital transformation.

OCR (Optical Character Recognition) and its application - DayDayNews



Thanks: Forough Karandish

Editor: Zhu Yapo

MBA, graduated from Beijing University of Science and Technology, Zhihong, Peking University, Singapore -Royce, JCI, Ariba and other international companies, co-founder of Shanghai Trend Research Technology.

.

technology Category Latest News