Markus_automation Feb 6 at 10:42

Image Recognition – Why AI is Still Not the Perfect Assistant in This Task, and How image captcha solver Helped

Easy

7 min

1.1K

Image processing*Artificial Intelligence

Case

Translation

Original author: Александр

Up to a certain point, I sincerely believed that in today’s world manual CAPTCHA recognition was gradually becoming an anachronism, especially when it came to such simple CAPTCHAs as image-based ones—where one merely needs to read text off a photograph and input it as plain text. But as it turns out, things aren’t quite so straightforward (no matter how it may sound).

Sure, a thinking machine (read: AI) might easily decipher simple text on an image, but as the task becomes more complex, more questions inevitably arise. And what if the task is large-scale? Then comes the next question – the question of trust in the result, especially if at the outset the outputs were ambiguous.

But let’s go in order:

Introduction and the Role of the picture CAPTCHA solver

I needed to count the number of objects in a satellite image (in principle, an engaged and motivated human could perform this task effortlessly, but I wanted to automate it). The most obvious solution that came to mind was to use a neural network—after all, they’ve advanced to the point where they can read images, analyze them, and deliver meaningful answers.

At the initial stage, no one even considered employing an image CAPTCHA solver service, where real people are responsible for deciphering the CAPTCHA—the task wasn’t about a CAPTCHA at all, right? Spoiler alert: I was mistaken! It turns out that projects involving data annotation often grow out of such CAPTCHA recognition services where humans are the ones doing the deciphering.

Regarding my specific, localized task: it was objectively impossible to force a data annotation service on a CAPTCHA recognition platform for this purpose—what I needed was a concrete numerical answer, whereas annotation implies marking specific areas. And here we begin to see an answer to the question: “So, what does an image CAPTCHA solver service have to do with this?”

The solution was found precisely along these lines! We replaced the CAPTCHA image with our satellite photo and submitted it to the CAPTCHA recognition service under the guise of a simple image CAPTCHA, with the task of counting the number of objects, and instructed the respondent to enter the correct digit into the text field.

But I digress—let’s return to the matter at hand…

Using a Neural Network to Recognize Objects in a Satellite Image – Why the Smart Machine Couldn’t Handle a Simple CAPTCHA-Style Satellite Image

The scenario: a multitude of satellite images of hotels and resorts where it was necessary to digitize beach umbrellas and other related items—in simple terms, count the number of specific objects in the image.

It sounds like a simple task, and at first glance it is not overly complicated; however, as one delves deeper, various pitfalls start to emerge. Low image quality, color contrasts that prevent 100% confident object detection, and, of course, the imperfections inherent in the applied methods.

The neural network was tasked with counting the number of circular objects in the image, and several different images were used. During the process it became evident that the neural network was struggling—it produced outright nonsense (its final count was significantly higher than the actual number).

At first, everything seemed promising—I uploaded the first image and formulated a prompt along the lines of: “You are a top-notch data analyst; study this image and count the number of umbrellas present.”

After reviewing the neural network’s answer, I recalled a joke about a military officer and two scientists in a hot air balloon:

Two scientists, while flying in a hot air balloon, encountered some difficulties and the balloon started to descend. Below, they saw a man and asked him, “Sir, could you please tell us where we are?”

“In a hot air balloon,” the man replied without hesitation.

“That’s a soldier,” one scientist said to the other.

“Why do you say that?” asked the second scientist.

“Because the answer was quick and precise—but utterly useless!”

Just like that, the neural network responded very quickly and precisely—reporting, say, 60 umbrellas. Only, there were fewer umbrellas than that.

Subsequent attempts involved defining specific areas for counting, but to no avail… The answers varied with each attempt, suggesting significant hallucination by the model.

Methods That Didn’t Work unlike picture CAPTCHA solver service

I apologize on behalf of the authors of the methods the neural network attempted to use for counting, but in practice, none of them proved effective. Here’s what the artificial intelligence tried when working with the images:

1. Hough Transform for Circle Detection (Hough Circle Transform)
📌 Method Overview:
Automated search for circular objects in the image using OpenCV’s HoughCircles.
The method employs contour detection algorithms and highlights circles based on brightness gradients.
📌 Advantages:
✅ Effectively detects umbrellas when there is a clear contrast with the background.
✅ Operates quickly and automatically.
📌 Disadvantages:
❌ May mistakenly count other circular objects (e.g., stones, swimming pools).
❌ Struggles with partially obscured objects.
📌 Result for the Image:
Beach umbrellas: 88 (an overestimated count due to false positives).

2. Morphological Analysis
📌 Method Overview:
Utilizes morphological operations (opening, closing, noise filtering) to isolate umbrellas and deck chairs.
Initially, the image is binarized followed by contour detection.
📌 Advantages:
✅ Effective when objects are regularly arranged.
✅ Helps remove noise and minor details.
📌 Disadvantages:
❌ Performs poorly if umbrellas overlap or have complex shadows.
❌ May overlook subtle objects.
📌 Result for the Image:
Beach umbrellas: 44 (more accurate than the Hough method, yet still prone to errors).

3. K-means Clustering
📌 Method Overview:
Clusters pixels based on color and intensity using the K-means algorithm.
This allows for differentiating objects based on tone and spatial distribution.
📌 Advantages:
✅ Good for grouping similar objects.
✅ Suitable for analyzing large images with recurring elements.
📌 Disadvantages:
❌ Cannot differentiate objects by shape, only by color.
❌ Requires precise tuning of the number of clusters.
📌 Result for the Image:
Beach umbrellas: 132 (overlap with other objects resulted in an inflated count).

4. Histogram Analysis
📌 Method Overview:
Analyzes the distribution of pixels based on brightness and color.
It evaluates the frequency of certain hues (for example, white umbrellas).
📌 Advantages:
✅ Simple to implement.
✅ Can be applied even on low-quality images.
📌 Disadvantages:
❌ Does not separate objects by shape.
❌ Can result in over- or underestimation if the beach has many white objects.
📌 Result for the Image:
Beach umbrellas: 87 (close to the real number, yet still not entirely accurate).

In the final step, I asked the network to analyze the image without relying on the aforementioned methods—just like a human would. Here’s what we obtained:

Visual Counting (Human-like Analysis)
📌 Method Overview:
The image is examined visually and objects are counted manually.
Beach umbrellas are recognized by their round shape and usually light color.
Deck chairs, on the other hand, have a rectangular shape and are arranged in structured rows.
📌 Advantages:
✅ High accuracy, provided the objects are well-distinguished.
✅ Allows for consideration of landscape features and differences between objects.
📌 Disadvantages:
❌ Time-consuming when analyzing a large number of images.
❌ Subjectivity may introduce human error.
📌 Result for the Image:
Beach umbrellas: 38
Deck chairs: 80–90

Surprisingly, this method produced the most accurate result of all the ones discussed (even though the neural network still couldn’t count the exact number). And again, I have serious doubts that the neural network truly performed a visual count as it claimed.

Thus, the attempt to use a neural network for satellite image recognition completely failed.

Satellite Image Recognition Masquerading as an Image CAPTCHA (The Synergy Between CAPTCHA solving and Object Annotation Services)

As I mentioned earlier, the human resource available through CAPTCHA services—which employ real people for object recognition tasks—can yield positive results, especially when one doesn’t tackle the task head-on but instead exercises a bit of creativity (a necessary touch given the nontrivial nature of the task).

So here’s the approach: we provided the workers with a satellite image along with the task—to count the number of umbrellas in the image. The worker was expected to enter a number as their answer. Then, through a simple analysis, we could determine whether the task was performed diligently: if all objects were counted in 3–4 seconds, it would be unrealistic, implying a shortcut; however, if the time taken was longer and the result did not simply read “123” (an auto-generated answer), the response was accepted.

Thus, manual image CAPTCHA solver produced a positive outcome compared to the neural network. For comparison, I used the GPT-4 model, as at the time of writing it was the only one capable of handling images.

A More Advanced Approach to Recognizing Satellite Images Under the Guise of an Image CAPTCHA

Taking this further, I envision assembling a pool of reliable workers who do not intentionally make errors, who follow the task requirements precisely (even if it’s a small niche task, they execute it correctly), and then collaborating with this specific group.

However, this would require closer cooperation with the CAPTCHA service’s support team, as it is unlikely one could figure everything out independently—but the idea, in my view, holds potential.

The Financial Aspect - what more expensive image CAPTCHA solver or AI?

Now, regarding costs—in the demonstration mode, the cost for a single recognition was $0.001 per image, meaning that for 1,000 images, it cost $1 (which might explain why workers were not overly keen on this task). But if you raise the price, say to $3 per 1,000 images, it becomes more attractive, at least in theory.

Let’s compare this with the cost of using the OpenAI API with GPT-4: according to the pricing, for 1 million tokens you pay $0.0150 plus caching and response fees, but for our purpose, the key figure is for 1 million tokens.

Since I sent the image to the neural network in base64 (after converting the image to that format), each image took up roughly 850,000 tokens—give or take. Taking into account the caching and the network’s response fee, we’re looking at about $0.0150 per image, which for 1,000 images comes to $15.

*Savings on Using a Neural Network Raises Questions*

Even at this level, the difference is apparent, wouldn’t you agree? I’m not urging anyone to adopt any specific method—just presenting the raw numbers and my experience. Mine, at least; others might have different examples.

Hubs:

If this publication inspired you and you want to support the author, do not hesitate to click on the button