Today, I successfully solved the text-extracting problem I had for 3 days after searching the internet for a solution and could not get a better one.
I just wanted to extract text from images! So why not code my own script!
Almost all the available online converters require a premium subscription in order to use their 0-C-R Technology.
The few ones I found could not do exactly what I wanted!
So I had about 20 pages of these pdf-image format documents that were scanned by a certain cam scanner and sent over! My challenge was to convert them to text since the original document could have been misplaced.
I decided to write my own script to solve the issue at hand! I mainly used two important libraries every Pythonista should know 🙇♂️.
Pytesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others Tesseract is an excellent open-source engine for OCR. But it can't read PDFs on its own. So I had to convert my pdfs to png formats.
OpenCV-Python is a library of Python bindings designed to solve computer vision problems OpenCV-Python makes use of Numpy, which is a highly optimized library for numerical operations with a MATLAB-style syntax. All the OpenCV array structures are converted to and from Numpy arrays.
This also makes it easier to integrate with other libraries that use Numpy such as SciPy and Matplotlib. I also used os built-in module for walking and navigating through my machine.
For Linux users:
sudo apt-get update sudo apt-get install tesseract-ocr sudo apt-get install libtesseract-dev
I am using a windows machine: so bear with me here. The Installation is all different!
💨 Open Tesseract at UB Mannheim Github Link
💨 Download your latest installer: 64 Bit or 32 Bit
💨 Run the
exe file as admin
💨 Install Tesseract
pip install pytesseract in the terminal.
Yeah, you read that right! You need to pip install it again in Windows.
The rest apply to all. So let's proceed 🚀 >>>>
Easy peasy, just :
pip install pip install opencv-python
py file and we start coding 🚀 >>>>
import cv2 import pytesseract import os
By default, Tesseract is installed in Program Files on Windows. If you did everything correctly, add this below our imports!
# Telling Python where to find Pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
Yeah, we need to tell Python where to find Tesseract 👆
If you have a bunch of images like me, you may need to print the current working directory.
# print current woring directory my_folder = os.listdir() print(my_folder)
Just make sure your image(s) are in the same root directory where you are running your program.
Now filter out images only. This depends on the extension. Mine are
jpg s, yours might be
png s. Just adjust to your format.
my_images =  # Data to be extracted for file in my_folder: # print(file) if file.endswith("jpg"): my_images.append(file) print(my_images)
Now, let's create a simple function underneath to run through with a
for loop in our image container reading them one by one and extracting text.
def my_reader(): for image in my_images: # Read image with openCV read_image = cv2.imread(image) # Extract text using tesseract engine text = pytesseract.image_to_string(read_image) # create a new file and write our extracted text my_extract = open("my_extract.txt", "a+") my_extract.write(text) # close the file my_extract.close() return "Done Sir! It was Fun" print(my_reader())
Refer to code comments for explanation. Yeah! That's It! You have extracted text from the Image(s)!
I also extracted from this image as a sample for you: My Output:
TOP 6 MOST USEFUL PYTHON 3.9 FEATURES 1. DICTIONARY UNION OPERATORS 2.TYPE HINTING GENERICS IN STANDARD COLLECTIONS 3. TIME ZONE DATABASE PRE-INSTALLED 4. EASILY REMOVE PREFIXES AND SUFFIXES 5.NEW PYTHON PARSER 6.BETTER MODULES FOR GCD AND LCM
Enjoyed the article? Kindly subscribe & follow me for related content😊!
Follow me on Twitter
Ronnie Atuhaire 😎