Today, I successfully solved the text-extracting problem I had for 3 days after searching the internet for a solution and could not get a better one.
I just wanted to extract text from images! So why not code my own script!
Almost all the available online converters require a premium subscription in order to use their 0-C-R Technology.
The few ones I found could not do exactly what I wanted!
So I had about 20 pages of these pdf-image format documents that were scanned by a certain cam scanner and sent over! My challenge was to convert them to text since the original document could have been misplaced.
I decided to write my own script to solve the issue at hand! I mainly used two important libraries every Pythonista should know 🙇♂️.
🔸Pytesseract
🔸OpenCV
✨ Pytesseract
Pytesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others Tesseract is an excellent open-source engine for OCR. But it can't read PDFs on its own. So I had to convert my pdfs to png formats.
✨ OCV
OpenCV-Python is a library of Python bindings designed to solve computer vision problems OpenCV-Python makes use of Numpy, which is a highly optimized library for numerical operations with a MATLAB-style syntax. All the OpenCV array structures are converted to and from Numpy arrays.
This also makes it easier to integrate with other libraries that use Numpy such as SciPy and Matplotlib. I also used os built-in module for walking and navigating through my machine.
Tesseract Installation
For Linux users:
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install libtesseract-dev
Windows
I am using a windows machine: so bear with me here. The Installation is all different!
💨 Open Tesseract at UB Mannheim Github Link
💨 Download your latest installer: 64 Bit or 32 Bit
💨 Run the exe
file as admin
💨 Install Tesseract
💨 Now pip install pytesseract
in the terminal.
Yeah, you read that right! You need to pip install it again in Windows.
The rest apply to all. So let's proceed 🚀 >>>>
OCV Installation
Easy peasy, just :
pip install
pip install opencv-python
Create your py
file and we start coding 🚀 >>>>
import cv2
import pytesseract
import os
By default, Tesseract is installed in Program Files on Windows. If you did everything correctly, add this below our imports!
# Telling Python where to find Pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
Yeah, we need to tell Python where to find Tesseract 👆
If you have a bunch of images like me, you may need to print the current working directory.
# print current woring directory
my_folder = os.listdir()
print(my_folder)
Just make sure your image(s) are in the same root directory where you are running your program.
Now filter out images only. This depends on the extension. Mine are jpg
s, yours might be png
s. Just adjust to your format.
my_images = [] # Data to be extracted
for file in my_folder:
# print(file)
if file.endswith("jpg"):
my_images.append(file)
print(my_images)
Now, let's create a simple function underneath to run through with a for loop
in our image container reading them one by one and extracting text.
def my_reader():
for image in my_images:
# Read image with openCV
read_image = cv2.imread(image)
# Extract text using tesseract engine
text = pytesseract.image_to_string(read_image)
# create a new file and write our extracted text
my_extract = open("my_extract.txt", "a+")
my_extract.write(text)
# close the file
my_extract.close()
return "Done Sir! It was Fun"
print(my_reader())
Refer to code comments for explanation. Yeah! That's It! You have extracted text from the Image(s)!
I also extracted from this image as a sample for you: My Output:
TOP 6 MOST USEFUL
PYTHON 3.9
FEATURES
1. DICTIONARY UNION OPERATORS
2.TYPE HINTING GENERICS IN STANDARD COLLECTIONS
3. TIME ZONE DATABASE PRE-INSTALLED
4. EASILY REMOVE PREFIXES AND SUFFIXES
5.NEW PYTHON PARSER
6.BETTER MODULES FOR GCD AND LCM
Enjoyed the article? Kindly subscribe & follow me for related content😊!
Follow me on Twitter
Ronnie Atuhaire 😎