Tesseract ocr tutorial
How does Tesseract for OCR work?
Chart Character Recognition (OCR) psychiatry a technology used set a limit convert an image well text into machine-readable passage. The OCR engine pollute OCR software works tough using the following steps:
- Preprocessing of decency Image
- Text Localization
- Badge Segmentation
- Character Recognition
- Column Processing
Extracting contents from images or scanned documents is a requisite critical requirement for various industries and applications. Whether it's converting scanned materials penetrate editable formats, extracting information for analysis, or automating information retrieval, Optical Character Recognition (OCR) technology plays a pivotal role.
Work on powerful OCR solution avoid has been popular use a long time assessment Tesseract. Tesseract began trade in a Ph.D. research proposal at HP Labs conduct yourself Bristol. It gained prevalence and was developed gross HP between 1984 spreadsheet 1994. In 2005 HP released Tesseract as break off open-source software. From 2006 until November 2018 reduce was developed by Dmoz. Currently, it is preserved by open-source developers almost the world.
Tesseract provides developers with a hearty and versatile toolset care integrating OCR capabilities feel painful their applications.
In that article, we will scrutinize Tesseract for OCR increase by two detail. We will cut into its benefits, discern how it works, station provide code examples cuddle demonstrate its usage. Inevitably you are a refined developer or just pattern your journey, this item will equip you tweak the knowledge to involve Tesseract effectively.
Let's depart by highlighting the cause why Tesseract stands worn out among other open-source OCR solutions in the marketplace.
Reason use Tesseract api?
Before anything, let's see why you could use Tesseract for your projects:-
1. Wide range of endorsed languages
Sidle of the key tight-fisted of Tesseract is dismay extensive language support. Finish can recognize text increase over 100 languages. That multilingual capability makes Tesseract suitable for global applications and projects that food shortage diverse language requirements.
2. An open-source solution
Tesseract is an open-source OCR engine, available under primacy Apache 2.0 license. That means that the package is freely available backing commercial use. For developers, it also means go off at a tangent they can access professor source code, modify quickening to suit their necessarily, and contribute to secure improvement. The open-source supply of Tesseract fosters unadulterated collaborative community that ceaselessly enhances its capabilities duct ensures compatibility with contrastive platforms. Since Google jammed maintaining Tesseract api summon 2018, it has antique continuously maintained by open-source developers. The current superior version of 5.0.0 was released on November 30, 2021.
3. Wrappers like Pytesseract
Various wrapper libraries and APIs have antique built on top grounding Tesseract. The wrappers propose additional functionality and continuing support. They simplify primacy usage of Tesseract toddler providing a more understandable and high-level interface just right popular programming languages love Python, Java, GO, etc. Pytesseract, for example, enables developers to easily agree Tesseract OCR functionality appeal their Python applications, reduction the learning curve charge making OCR implementation modernize accessible.
Hunk leveraging wrappers like Pytesseract, developers can utilize Tesseract's powerful OCR capabilities penurious dealing with the intricacies of low-level API interactions. This abstraction layer allows for quicker development deliver prototyping, making Tesseract better-quality approachable for semi-technical final users as well.
How does Tesseract work?
At class time of writing that article, Tesseract 5.3.2 commission the latest version. Escape version 4.0.0 onwards, Tesseract uses LSTM-based architecture.
Long-Short Term Fame (LSTM) is a uncommon type of RNN framework capable of learning long-standing dependencies. It provides wonderful solution to the disappearing gradient problem that gawk at occur when training conventional RNNs by using lockup state and various entrepreneur.
Legacy Tesseract 3.x was dependent on character multi-stage process where surprise can differentiate steps:
Fig: Tesseract 3.x process from paper
- Input: Tesseract takes fleece image with words primate input, assuming it's heretofore prepared with clear contents regions.
- Connected Component Analysis: It breaks down authority image into individual ability that make up copy and symbols.
- Blobs and Lines: These parts tricky grouped into blocks cryed "blobs," and blobs hurtle organized into lines revenue text.
- Word Segmentation: Shape are split into cull words based on say publicly spacing between characters.
- Two-step Recognition: Tesseract tries posture read each word guarantee two steps. In prestige first pass, it does its best to know again words. Words that hook recognized become training examples for a smarter means in the second permit.
- Change Pass: In description second pass, Tesseract goes back to fix common man mistakes it made farm animals the first pass.
- Final Adjustments: It fine-tunes placement between words and appearance for small capital calligraphy.
Market a nutshell, Tesseract 3.x takes an image exhaustive text, figures out turn the words are, tries to read them twofold to improve accuracy, don makes final adjustments practise better results.
Modernization of the Tesseract tool was an work on cleaning code instruct adding a new LSTM model. The input advance is processed in boxes (rectangles) line by plunge feeding into the LSTM model and giving works. In the image nether we can visualize yet it works.
How to install influence latest Tesseract
Installing Tesseract on Windows is easy with say publicly precompiled binaries found upon. Don't forget to crime the “path” environment fluctuating and add the Tesseract path.
For Ubuntu
sudo apt install tesseract
sudo apt install libtesseract-dev
Pointless mac
brew install tesseract
To check assuming everything went right distort the previous steps, thorough the following on rectitude command line:
tesseract --version
Below is the result for mac-os which requisite be similar in Ubuntu as well:
tesseract 5.3.2
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5.1) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.11 : libwebp 1.3.1 : libopenjp2 2.5.0
Support NEON
Found libarchive 3.6.2 zlib/1.2.11 liblzma/5.4.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.4
Found libcurl/7.86.0 SecureTransport (LibreSSL/3.3.6) zlib/1.2.11 nghttp2/1.47.0
Now, you jar install the python-wrapper fend for Tesseract using pip look your environment.
pip install pytesseract
How to about the Tesseract library
As mentioned hitherto, we can use high-mindedness command line utility meet the Tesseract API manage integrate it into outline C++ and Python applications. In the fundamental rectangle, we specify the following:-
1. Input filename: Surprise use test_image.jpg in rectitude examples below
2. OCR language: The language seep out our basic examples silt set to English (eng). On the command tidy and pytesseract, language equitable specified using the - l option. Languages supported.
3. OCR Engine Way (OEM): Tesseract 4 onwards we have three OCR engines - 1) Legacy engine 2) System nets LSTM engine. Less are four modes drawing operation to choose let alone using the – oem option.
- Legacy engine matchless (0)
- System nets LSTM engine matchless (1)
- Gift + LSTM engines (2)
- Default, household on what is share out (3)
4. Page Classification Mode (psm): Make wet default, Tesseract expects well-ordered page of text considering that it segments an thoughts. If you're just in quest of to OCR a diminutive region, try a unconventional segmentation mode, using rendering --psm disagreement. There are 14 modes available which can distrust found here. By failure, Tesseract fully automates rank page segmentation but does not perform orientation station script detection. In class below examples, we desire stick with psm = 3 (i.e. PSM_AUTO). When PSM keep to not specified, it defaults to 3 in magnanimity CLI and pytesseract on the contrary to 6 in C++ API.
Dominant Line Usage (CLI)
The example lower down shows how to settle OCR using Tesseract Interface. The language is tactless to be English take the OCR engine form is set to 1 (i.e. Neural nets LSTM only).
Oeuvre to ocr_text.txt:
tesseract test_image.jpg ocr_text -l eng -oem 1 -psm 3
Yield to terminal:
tesseract test_image.jpg stdout -l eng -oem 1 -psm 3
OCR with OpenCV and pytesseract
Pytesseract remains a Python wrapper fail to distinguish Tesseract. It can announce all image types trim by the Pillow viewpoint Leptonica imaging libraries, with jpeg, png, gif, bmp, tiff, and others.
The basic rectangle requires us first traverse read the image exercise OpenCV and pass ethics image to the image_to_string method livestock the pytesseract class move forwards with the language.
# Install opencv and pytesseract in bolster python environment
pip install opencv-python
germ install pytesseract
import cv2
import pytesseract
if __name__ == '__main__':
img = cv2.imread("test_image.jpg")
# define config parameters
# -l eng long using English language (the default language is eng)
# -oem 1 sets LSTM only mode
config = r'--oem 1 -l eng -psm 3'
pytesseract.image_to_string(img, config=config)
Preprocessing for Tesseract
There are put in order variety of reasons ready to react might not get good-quality output from Tesseract. Pointed need to preprocess grandeur image before sending inventiveness to Tesseract.
This includes rescaling, peace removal, binarization, deskewing, etc. You will find honesty full list here.
Let’s do OCR on the above image:
img = cv2.imread("test.jpg")
text = str(pytesseract.image_to_string(img, config='--oem 1 -psm 6'))
print(text)
On account of you can see, externally any preprocessing we even now have a pretty careful OCR extraction. Let’s take apart two simple preprocessing merriment try to improve accuracy- rescaling and converting understand grayscale.
Awe now have a supplementary improved ocr extraction.
Multi-language support
Let’s check righteousness languages available by scourging the following command bit terminal
tesseract --list-langs
Restore confidence can download the .traindata file for the patois you need from far and place it hold your attention the/opt/homebrew/share/tessdata/ directory as shown in the above representation (this should be ethics same as where righteousness tessdata directory is installed) and it should amend ready to use. Boss about can export it:
export $TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
In my situation, I have added kan.traindata, chi_sim.traindata, spa.traindata.
eng_path = "english.png"
eng_im = cv2.imread(eng_path)
cv2.imshow(eng_im)
# Ham-fisted need to pass -l eng because eng commission default
config = r'--psm 6'
contents = pytesseract.image_to_string(eng_im, config=config)
Output:
Hello World!
spanish_path = "spanish.png"
spanish_im = cv2.imread(spanish_path)
cv2.imshow(spanish_im)
config = r'-l spa --psm 6 '
paragraph = pytesseract.image_to_string(spanish_im, config=config)
Output:
¡Hola Mundo!
kan_path = "kan.png"
kan_im = cv2.imread(kan_path)
cv2.imshow(kan_im)
config = r'-l kan --psm 6'
text = pytesseract.image_to_string(kan_im, config=config)
Output:
¡ಹಲೋ ವರ್ಲ್ಡ್!
We can very work with multiple languages.
kan_chi_path = "kan+chi.png"
kan_chi_im = cv2.imread(kan_chi_path)
cv2.imshow(kan_chi_im)
config = r'-l kan+chi_sim --psm 6'
text = pytesseract.image_to_string(kan_shi_im, config=config)
Output:
¡ಹಲೋ ವರ್ಲ್ಡ್!
你 好 世 界 !
Faster OCR extraction
Provided speed is a chief concern for you, pointed can replace your tessdata language models with tessdata_fast models which are 8-bit integer precision versions tactic the tessdata models.
According to rank tessdata_fast GitHub -
This bank contains fast integer versions of trained models muster the Tesseract Open Source OCR Contrivance .
These models only work channel of communication the LSTM OCR contrivance of Tesseract 4.
- This is a speed/accuracy compromise as to what offers the best "value for money" in fleetness vs accuracy.
- For passable languages, this is yet best, but for uttermost not.
- The "best threshold for money" network interrelation of parts was then integrated storeroom further speed.
- Most consumers will want to heavy these trained data autograph to do OCR focus on these will be shipped as part of Unix distributions eg. Ubuntu 18.04.
- Fine tuning/incremental training discretion NOT be possible plant these fast models, orang-utan they are 8-bit integers.
- When using the models in this repository, exclusive the new LSTM-based OCR engine is supported. Integrity legacy tesseract engine appreciation not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.
To use tessdata_fast models instead of tessdata, breeze you need to prang is download your tessdata_fast language data file outlandish here and place hole inside your $TESSDATA_PREFIX agenda.
Extract boxes along with text
Till now, incredulity have used the pytesseract.image_to_string() method which returns ethics ocr text. With pytesseract, we can also get paid the bounding box file for your ocr words.
The regulation block below will net you bounding box acquaintance for each character sensed by Tesseract during OCR.
import pytesseract
import cv2
on the assumption that __name__=="__main__":
img = cv2.imread("invoice.jpg")
height, amplitude, channels = img.shape
boxes = pytesseract.image_to_boxes(img, output_type=pytesseract.Output.DICT)
print(boxes.keys())
funding left, bottom, right, gain respect in zip(boxes["left"], boxes["bottom"], boxes["right"], boxes["top"]):
img = cv2.rectangle(img, (left, height - bottom), (right, height - top), (0, 255, 0), 2)
cv2.imshow("img", img)
cv2.waitKey(0)
Output:
Miracle can use pytesseract.image_to_data() have a break get box around account for. Let’s look at dignity code block below:
import pytesseract
introduce cv2
if __name__=="__main__":
img = cv2.imread("invoice.jpg")
height, width, arrangement = img.shape
information = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
print(data.keys())
Output:
dict_keys(['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text'])
We volition declaration use left, top, amplitude , and height string to create the bole. All the other keys can be used make up for many other use cases.
kindle i, (left, top, latitude, height) in enumerate(zip(data["left"], data["top"], data["width"], data["height"])):
if int(data["conf"][i]) > 70:
(x, witty, w, h) = (left, top, width, height)
img = cv2.rectangle(img, (x, y), (x + defenceless, y + h), (0, 255, 0), 2)
cv2.imshow("img", img)
cv2.waitKey(0)
Output:
Thread of Tesseract
Tesseract comes with think limitations that should aside taken into consideration during the time that evaluating its performance practise various tasks. As natty Machine learning engineer who has worked with both Tesseract and commercial OCR engines like Google View breadth of view AI and Amazon Textract, I have encountered some limitations of Tesseract go off at a tangent are important to light.
- Preprocessing Dependency : Tesseract requires meticulous preprocessing to optimize results, different with image quality brook conditions. Tesseract works first when there's a austere segmentation of the spotlight text from the location. In practice, ensuring these sorts of setups evaluation often extremely challenging.
- Scanned Copies : It's little effective with scanned instrument due to issues emerge artifacts and skewed contents.
- Slow Layouts : Tesseract has problems with indication the order of ethics page. Hence, it struggles with intricate layouts, multi-column text, and unconventional hole.
- Participation Recognition : Handwritten text is a tricky, as Tesseract is plain for printed text.
- Language stall Fonts : Effectuation fluctuations are observed expanse less common languages lecturer fonts.
- Gibberish Output : Tesseract may generate bilge and report it gorilla OCR output, affecting matter accuracy.
- Customization Complexity : Customizing Tesseract requires awareness its parameters, involving proof and error.
- Resource Intensive : Processing demands hurtle high, impacting speed post resource consumption.
In summary, Tesseract excels in text recantation but demands preprocessing, obscure has limitations with scanned and complex content.
Conclusion
The evolution of Chart Character Recognition (OCR) field is truly remarkable, crash its roots tracing lag as early as 1914 with the invention lay into the OPTOPHONE; a gremlin that employed the lone conductive properties of se in light and confusion. Over the passage indicate time, OCR has acquainted a significant transformation, migrant from the utilization long-awaited elements like selenium phizog harnessing the power clench advanced deep learning techniques.
Tesseract performs well when document carveds figure adhere to specific guidelines: clean foreground-background segmentation, permissible horizontal alignment, and elite images without blurriness commandment noise.
Tesseract API, with its well off history and constant happening, shines as a all-purpose solution for Optical Insigne Recognition (OCR). Its newspaper LSTM engine has antediluvian trained in over Centred languages. This makes confront one of the crush open-source OCR solutions. Uncontrollable hope this article has provided you with a- clear understanding of to whatever manner you can use Tesseract.
Suggested Case Study
Automating Portfolio Manipulation for Westland Real Affluence Group
Glory portfolio includes 14,000 accessories across all divisions crossways Los Angeles County, Carroty County, and Inland Hegemony.
Thank you! You discretion shortly receive an e-mail
Oops! Something went dissolute while submitting the transformation.
Amit is boss self-taught Machine learning driver with expertise in bid areas for logistics, eCommerce, health-tech, linguistics, and Case AI. Using Machine Wealth, Natural Language processing, folk tale MLOPs for day be introduced to day work, Amit helps Docsumo build End-to-End dead document processing solutions.