Grabbing Screen Text with a Shell Script

2024-03-18 · 9 min

Recently, I stumbled upon an OCR tool for Linux. However, I didn’t like the idea of needing a GUI app, so I wrote a shell script connecting the right CLI tools instead.

This article dissects “grab-text”, a simple POSIX-compatible shell script I wrote for grabbing text from your screen.

Table of Contents

You can find the full script on GitHub, as the examples have been simplified.

The What & Why

The idea is simple: grab text from the current screen. In reality, it can be quite complicated to do it right.

Unlike the built-in solution of macOS, there’s no equivalent to be found in the usual display/window managers.

In a discussion on Hacker News about a GUI app providing functionality, I found custom shell solutions in the comments that do most of the work by cobbling together a few CLI tools. Given that these scripts were quite simplistic, I decided to create my own version with a few extras.

How to Solve the Task

The overall task breaks down into the following steps:

Take a screenshot, preferably let the user select a range.
Process the screenshot to improve text detection.
Perform OCR on the screenshot.
Copy the result to the clipboard.

These steps delineate the different tool categories we need to connect to achieve the desired outcome:

Screenshotter
Image processing
Text recognition
Clipboard management

With the groundwork laid out and the necessary tools identified, we can proceed to how to design the actual script.

What Kind of Script Do We Want?

Before diving into the code itself, we need to consider several key questions:

Is this script intended as a temporary fix, meant to run a handful of times before being discarded, or is it being developed as a long-term solution?
Should the script be portable, or does it only run on my machine?

The first question affects the level of documentation and error handling a script does, whereas the second affects the overall approach to the code and available features.

Documentation & Error Handling

In the case of grab-text, the goal is to serve as a lasting solution. Therefore, thorough documentation is crucial to ensure that my future self can easily understand and maintain the code without frustration.

When it comes to error management, the script should handle the most common failure modes. So, sufficient error handling from the get-go is a must. Furthermore, the script will most likely be executed from a GUI environment and run in the background, so it must somehow communicate with the user if anything goes wrong.

Script Portability

Writing shell scripts just for yourself can significantly boost your productivity. But making scripts more portable so that other people might use them, regardless of their actual shell environment, is even better!

One thing I recommended in the past is using a more flexible shebang to target bash:

shell

#!/usr/bin/env bash

The benefit here is that rather than using #!/bin/bash directly, the script opts for the first instance of bash that appears in the user’s $PATH. That makes it more portable, like when used on macOS, and you don’t want to use the outdated system version but instead use the one you installed via a package manager.

For grab-text, though, I wanted a bit more portability than restricting myself to bash.

If the script doesn’t require any bash-related features, why not use sh instead, making the script POSIX compatible?

Let’s take a look at what features we usually need to refrain from using when going the sh route:

No shell options via shopt
No conditional with [[ ... ]]
No bash-style arrays
$(...) instead of backticks
Function declaration style might need to be adapted
No local variables

Given the overall requirements, sticking to POSIX-compliant sh scripting appears feasible.

Now, it’s finally time to dive into the code!

Refining What the Script Is Supposed to Do

While I’ve already outlined the core task earlier, it doesn’t paint the complete picture of what the script needs to do to achieve its task.

To extend the script’s portability, not just in terms of the used shell but also the used tools, it should support multiple binaries for each category.

With that in mind, and looking at the problem more closely with our “technical glasses”, the overall task looks more like this:

Detect available binaries.
Create a working directory and ensure cleanup.
Take a screenshot.
Optimize the screenshot.
Recognize text.
Copy text to clipboard.
Notify user of success.

Let’s go over the steps to define what they need to do.

Detect available binaries

There are 5 categories of tools involved:

Taking a screenshot
Optimize screenshot
Recognize text (OCR)
Copy to clipboard
Notify user

Even though I could have added alternative options to each category, I’ve decided only to support different screenshotting and clipboard tools. However, the optimization and notification categories can be optional.

To streamline the script and avoid duplicating code, I’ve created a function to locate a binary:

shell

_gt_find_required_binary() {
    BINS="$1"
    CATEGORY="$2"

    for BIN in ${BINS}; do
        if command -v "$BIN" >/dev/null 2>&1; then
            FOUND_BIN="${BIN}"
            return
        fi
    done

    >&2 printf "ERROR: No binary for category '%s' found. Compatible options: %s.\n" "${CATEGORY}" "${BINS}"
    exit 1
}

The function accepts a string of space-separated binaries to look for and stores the first one found in the out-of-function variable FOUND_BIN.

It’s used like this:

shell

SCREENSHOT_BINS="maim scrot gnome-screenshot"

FOUND_BIN=""
_gt_find_required_binary "${SCREENSHOT_BINS}" "Screenshot"
SCREENSHOT_BIN="${FOUND_BIN}"

Create a working directory and ensure cleanup

For every execution, two files are produced: a screenshot and a text file that holds the recognized text. Both aren’t needed after transferring the text to the clipboard.

Creating a temporary directory is done with mktemp, and the file names are based on invocation time:

shell

WORKING_DIR=$(mktemp --directory)
BASENAME=$(date +"%Y-%m-%d--%H-%M-%S")

There are systems in place to clean up the temporary directory, yet we should tidy up after ourselves immediately to not clutter up the system unnecessarily.

The best tool for scripts to do that is creating a trap:

shell

_gt_cleanup () {
    if [ -z "${1}" ]; then
        return
    fi

    if [ ! -d "${1}" ]; then
        return
    fi

    rm -r -- "${1}"
}

trap "_gt_cleanup '${WORKING_DIR}'" EXIT INT TERM

The function _gt_cleanup checks the first argument to be set and that it’s a directory and then removes it recursively.

The trap is called on the EXIT, INT, and TERM signals, so whatever happens, the script attempts to remove its files.

Take a screenshot

The call depends on the actual binary used, so a case statement is required:

shell

case "${SCREENSHOT_BIN}" in
    scrot)
        scrot \
            --select \
            --freeze \
            --quality 100 \
            "${WORKING_DIR}/${BASENAME}.png"
        ;;
    
    maim)
        maim \
            --select \
            --nodrag \
            --quality=10 \
            "${WORKING_DIR}/${BASENAME}.png"
        ;;

    gnome-screenshot)
        gnome-screenshot \
            --area \
            --file \
            "${WORKING_DIR}/${BASENAME}.png"
        ;;
esac

To support another screenshot tool, I’d just add it to SCREENSHOT_BINS to be detected and add another case clause with the correct arguments to produce a png file.

Optimize the screenshot

To not force users to install more dependencies than absolutely necessary, this step is optional:

shell

if command -v mogrify >/dev/null 2>&1; then
    mogrify \
        -modulate 100,0 \
        -resize 400% \
        "${WORKING_DIR}/${BASENAME}.png"
fi

If mogrify is available, the png is desaturated and enlarged to improve OCR performance.

Recognize text

The tesseract call includes the languages to be detected, which are defined in the script’s header.

The call itself is what you’d expect:

shell

tesseract \
    -l "${LANG_CODES}" \
    "${WORKING_DIR}/${BASENAME}.png" \
    "${WORKING_DIR}/${BASENAME}" \
    >/dev/null 2>&1

Copy text to Clipboard

As there are multiple possible binaries, a case statement is needed again:

shell

case "${CLIPBOARD_BIN}" in
    xsel)
        xsel -bi < "${WORKING_DIR}/${BASENAME}.txt"
        ;;
    
    xclip)
        xclip -selection clipboard -in < "${WORKING_DIR}/${BASENAME}.txt"
        ;;
esac

Notify User of Success

The total processing time depends on the screen area chosen. That’s why I decided to add a notification once finished.

To make the code reusable for error handling, too, I created another function:

shell

_gt_notify () {
    SUMMARY="$1"
    URGENCY="${2:-"normal"}"
    if command -v "notify-send" >/dev/null 2>&1; then
        notify-send --urgency "${URGENCY}" --expire-time=3000 "Grab Text: ${SUMMARY}"
    fi
}

It accepts a string for the notification’s summary and, optionally, a second argument for its urgency.

More Considerations

Now that we have all the essential parts in place, it’s time for some improvements! Particularly in areas like error handling, as mentioned earlier.

In the same way that we alert the user upon successful completion, it’s just as important to inform them when an error occurs.

Therefore, let’s introduce another function to be called if something goes wrong:

shell

_gt_die () {
    _gt_notify "${1}" "critical"
    exit 1
}

The function simply tries to notify the user with a critical message and then exits the script with 1.

To use it, just check the previous return code, like after taking the screenshot:

shell

if [ $? -ne 0 ]; then
    _gt_die "ERROR: Taking screenshot with '${SCREENSHOT_BIN}' failed."
fi

For direct calls, it can be added with the || (double-pipe) operator:

shell

tesseract \
    -l "${LANG_CODES}" \
    "${WORKING_DIR}/${BASENAME}.png" \
    "${WORKING_DIR}/${BASENAME}" \
    >/dev/null 2>&1 \
|| _gt_die "ERROR: OCR with 'tesseract' failed'."

And there you have it!

Conclusion

This journey began after reading a few comments on Hacker News, which led to creating a portable script that fits my needs and, hopefully, the needs of others as well.

The laid-out methodical process of writing a simple shell script, complete with thorough documentation and comprehensive error management, may seem overkill for a one-off script. But never forget that any “I’ll only need a once” script might transform into a business-critical tool later. But it’s important to remember that any script you initially thought you’d only use once might become a business-critical tool later.

So save yourself some headaches and invest a little bit more time and keystrokes upfront to write a more polished and easier-to-maintain script.

Your future self will thank you!

Resources

grab-text on GitHub

Tools

Supported Screenshot tools:

Supported Clipboard tools:

Optional tools:

Imagemagick (mogrify) for optimizing the screenshot before OCR
libnotify (notify-send) for status notifications

#shell #tools

Support Me on Ko-fi

Grabbing Screen Text with a Shell Script

The What & Why

How to Solve the Task

What Kind of Script Do We Want?

Documentation & Error Handling

Script Portability

Refining What the Script Is Supposed to Do

Detect available binaries

Create a working directory and ensure cleanup

Take a screenshot

Optimize the screenshot

Recognize text

Copy text to Clipboard

Notify User of Success

More Considerations

Conclusion

Resources

Tools

Related Articles