Developing a Korean-alphabet OCR application(1)
Background of Development
Despite the existence of Korean character OCR programs applied in many products, most of OCR systems are internally developed(not open source), have low performance, and very expensive. There also aren’t much research papers discussing the performance of these OCR systems, very different from Chinese and English character generation problems.
While designing an evaluation method for a Korean handwriting GAN project, I had a hard time finding quality open source Korean text OCR programs, and I just decided to make one myself.
The structure of the Korean alphabet(Han-geul) varies from other alphabets. There are a very large pool of characters(11,000+) of the total alphabet, but unlike the Chinese alphabet, the Korean character can be split into 3 components: Chosung, Jungsung, and Jongsung. There are 20~30 possible letters for each component. Using the characteristic of the Korean alphabet, development of Korean-specialized OCR application may be considerably different from OCR for other languages.
Existing Korean OCR API/Projects
Google Tesseract OCR: Google Tesseract is an open source OCR engine developed by Google. This engine is very powerful for English character detection, although suffers in recognizing Korean strings.
Naver CLOVA OCR API: OCR engine by NAVER, very expensive API that the basic plan costs 6,000$ for 100,000 images.
Github projects: parksunwoo, MijeongJeon, Wongi-Choi1014, The second model uses a very noisy data augmentation process that makes data close to the wild with 87% test accuracy. The third paper uses no noises, although shows a 97% test accuracy.
Other OCR Programs: Many applications such as the Korean search engine, word processor,converter has OCR functionalities embedded inside the software.
A Google search of ‘Korean OCR’ doesn’t show many quality Github projects, research papers.
Project Objective
- Compare multiple renown deep learning model architectures for Korean OCR.
- Compare whether splitting the three components is beneficial for training speed/performance.
- Compare performance between other applications, projects and discuss improvements.
- Provide an open-source Korean OCR pipeline.