Ronnie Atuhaire's Blog ๐Ÿ˜Ž

Ronnie Atuhaire's Blog ๐Ÿ˜Ž

How I Extracted Text From Images With A few Lines Of Python!

How I Extracted Text From Images With A few Lines Of Python!

Subscribe to my newsletter and never miss my upcoming articles

Today, I successfully solved the text-extracting problem I had for 3 days after searching the internet for a solution and could not get a better one.

image.png I just wanted to extract text from images! So why not code my own script!

Almost all the available online converters require a premium subscription in order to use their 0-C-R Technology.

The few ones I found could not do exactly what I wanted!

So I had about 20 pages of these pdf-image format documents that were scanned by a certain cam scanner and sent over! My challenge was to convert them to text since the original document could have been misplaced.

I decided to write my own script to solve the issue at hand! I mainly used two important libraries every Pythonista should know ๐Ÿ™‡โ€โ™‚๏ธ.
๐Ÿ”ธPytesseract
๐Ÿ”ธOpenCV

โœจ Pytesseract

Pytesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others image.png Tesseract is an excellent open-source engine for OCR. But it can't read PDFs on its own. So I had to convert my pdfs to png formats.

โœจ OCV

OpenCV-Python is a library of Python bindings designed to solve computer vision problems OpenCV-Python makes use of Numpy, which is a highly optimized library for numerical operations with a MATLAB-style syntax. All the OpenCV array structures are converted to and from Numpy arrays.

image.png This also makes it easier to integrate with other libraries that use Numpy such as SciPy and Matplotlib. I also used os built-in module for walking and navigating through my machine.

Tesseract Installation

For Linux users:

sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install libtesseract-dev

Windows
I am using a windows machine: so bear with me here. The Installation is all different!

๐Ÿ’จ Open Tesseract at UB Mannheim Github Link
๐Ÿ’จ Download your latest installer: 64 Bit or 32 Bit

image.png ๐Ÿ’จ Run the exe file as admin
๐Ÿ’จ Install Tesseract
๐Ÿ’จ Now pip install pytesseract in the terminal.

Yeah, you read that right! You need to pip install it again in Windows.

The rest apply to all. So let's proceed ๐Ÿš€ >>>>

OCV Installation
Easy peasy, just :

pip install 
pip install opencv-python

Create your py file and we start coding ๐Ÿš€ >>>>

import cv2
import pytesseract
import os

By default, Tesseract is installed in Program Files on Windows. If you did everything correctly, add this below our imports!

# Telling Python where to find Pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

Yeah, we need to tell Python where to find Tesseract ๐Ÿ‘†

If you have a bunch of images like me, you may need to print the current working directory.

# print current woring directory
my_folder = os.listdir()
print(my_folder)

Just make sure your image(s) are in the same root directory where you are running your program.

Now filter out images only. This depends on the extension. Mine are jpg s, yours might be png s. Just adjust to your format.

my_images = []  # Data to be extracted
for file in my_folder:
    # print(file)
    if file.endswith("jpg"):
        my_images.append(file)
print(my_images)

Now, let's create a simple function underneath to run through with a for loop in our image container reading them one by one and extracting text.

def my_reader():
    for image in my_images:
        # Read image with openCV
        read_image = cv2.imread(image)
        # Extract text using tesseract engine
        text = pytesseract.image_to_string(read_image)

        # create a new file and write our extracted text
        my_extract = open("my_extract.txt", "a+")
        my_extract.write(text)

        # close the file
        my_extract.close()
    return "Done Sir! It was Fun"


print(my_reader())

Refer to code comments for explanation. Yeah! That's It! You have extracted text from the Image(s)!

I also extracted from this image as a sample for you: Top-6-Most-Useful-Python-3.9-features.png My Output:

TOP 6 MOST USEFUL

PYTHON 3.9
FEATURES

1. DICTIONARY UNION OPERATORS

2.TYPE HINTING GENERICS IN STANDARD COLLECTIONS
3. TIME ZONE DATABASE PRE-INSTALLED

4. EASILY REMOVE PREFIXES AND SUFFIXES

5.NEW PYTHON PARSER

6.BETTER MODULES FOR GCD AND LCM

Github Repo

Enjoyed the article? Kindly subscribe & follow me for related content๐Ÿ˜Š!

Follow me on Twitter

Ronnie Atuhaire ๐Ÿ˜Ž

ย 
Share this