kentavr009 Jun 3 at 04:55

How an AI CAPTCHA Solver Works: From OCR to Deep Learning

Easy

13 min

1.2K

Artificial IntelligenceMachine learning *

Review

Translation

Original author: Alex

CAPTCHA has become a familiar part of the internet: distorted texts, “find all the traffic lights” images, audio riddles, and other challenges that distinguish humans from machines. Every bot-system developer or QA engineer automating web scenarios has at least once run into a script suddenly stumbling over a CAPTCHA. A natural question arises: can a program be taught to solve CAPTCHAs the way a human does—quickly and reliably? In this article I will try to figure out how AI CAPTCHA solvers are built, from classical OCR methods to modern neural networks.

CAPTCHA Types and Why Bots Find Them Hard - CAPTCHA Solver AI

Before breaking a CAPTCHA, let’s look at the kinds that exist and why algorithms have trouble with them. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) comes in many forms. The main types can be classified as follows:

Text CAPTCHAs — classic images with distorted characters (letters, digits) that must be typed manually. In the past such tasks were solved with OCR (Optical Character Recognition). But as distortions and noise intensified, simple recognition became difficult: characters overlap, bend, are covered with interference purposely added against segmentation and recognition. This creates a problem for bot programs, since the computer must isolate and recognize each character, which is non-trivial without training.
reCAPTCHA v2 (Google) — the familiar “I’m not a robot” checkbox (which analyzes user behavior) and, if suspicious, a pop-up window with a 3 × 3 grid of images where you must pick pictures according to a criterion (e.g., all squares with cars). This CAPTCHA combines behavioral analysis with a visual object-classification task. Bots struggle because they must understand image content—a computer-vision problem.
reCAPTCHA v3 and Cloudflare Turnstile — invisible next-generation CAPTCHAs. They require no user action: the backend analyzes many behavioral, environmental, and browser parameters and assigns a hidden “suspicion rating.” If the rating is low, the user is considered human; if high, an extra check may follow. For a bot this is a tough barrier, because it must imitate human behavior across many signals, not solve a specific puzzle.
hCaptcha and FunCaptcha — alternative CAPTCHAs from other services (Cloudflare, Arkose Labs). Essentially similar to reCAPTCHA v2: the user either gets a set of images to classify or an interactive task (e.g., rotate a 3-D object or find an element). Each system adds its own variations of visual puzzles.
GeeTest and other puzzles — popular on Asian services: puzzle-piece variants (drag a fragment into the correct position), shuffled image tiles, simple questions, or sliders. For example, a puzzle CAPTCHA offers to align a cut-out fragment with a hole in the picture, requiring coordination and image understanding. Bots find this hard because it requires both pattern recognition and simulated human input (mouse movement).
Audio CAPTCHA — usually supplements a visual CAPTCHA for users with impaired vision. A distorted recording of numbers or words plays, which must be distinguished and typed. It is believed that humans find speech with noise easier than machines. Yet these CAPTCHAs are not flawless: Stanford researchers succeeded in automatically cracking audio CAPTCHAs with up to 75 % probability using speech-recognition algorithms. With powerful ASR (Automatic Speech Recognition) models, audio riddles no longer guarantee protection.
Behavioral and invisible CAPTCHAs — I already mentioned reCAPTCHA v3 and Turnstile; there are also hidden tests integrated into a site (for example, honeypot fields that only bots fill, or analysis of form-filling speed). All these are new-type CAPTCHAs checking the naturalness of user actions. The bot faces not a specific puzzle but the need to pretend to be a real user: move the mouse “like a human,” wait random delays, and have a “clean” browser. Such methods are harder to bypass directly by algorithm, so workarounds are needed—e.g., obtaining a reCAPTCHA token via API or using browser-fingerprint databases.

You can look in more detail in my previous article about CAPTCHA types.

Each CAPTCHA type requires a special solving approach. A universal CAPTCHA solver remains a challenge: the bot system must read text, classify pictures, or synthesize behavior, depending on the check encountered. Let’s move on to how algorithms try to solve CAPTCHAs and to the evolution of these methods.

From OCR to Neural Networks (CAPTCHA AI Solver): The Evolution of Bypassing CAPTCHA

The first attempts at automatic CAPTCHA solving were closely tied to the development of OCR (Optical Character Recognition). A classical text CAPTCHA—distorted letters/digits on a noisy background—was essentially designed to puzzle OCR systems. Old CAPTCHA versions could be cracked with relatively simple methods: filtering the image, extracting contours, segmenting into individual characters, and template matching or standard OCR engines like Tesseract. For some simple CAPTCHAs you could skip “intelligence” altogether: just overlay several sample digits to get a mask unique to each character and find matches in the picture. But such tricks suit only the most primitive and uniform CAPTCHAs.

CAPTCHA Complication vs. Algorithm Improvement. CAPTCHA creators responded to cracks by increasing complexity: characters were more heavily distorted, color noise and background patterns were added, fonts became inconsistent. CAPTCHAs appeared with stuck-together characters where letters overlap. All this hindered segmentation—the key step for classical OCR. Machine learning entered the game: researchers trained models to distinguish CAPTCHA characters even in noise. Back in the 2000s there were papers applying SVM and other algorithms to recognize specific CAPTCHA generators. But the breakthrough came with deep learning.

In 2014 Google announced a sensational result: its neural network learned to solve the toughest text CAPTCHAs with 99.8 % accuracy (how ironic—Google itself pioneered recognizing the defense it had effectively invented, I mean reCAPTCHA). The machine outperformed humans at what had once been meant as a purely human task! This immediately made text distortions useless—if an algorithm can read characters better than people, such protection loses meaning. Probably for this reason Google quickly moved reCAPTCHA from noisy texts to pictures and behavior evaluation.

As to my first sentence in this paragraph, another thought suggests itself: if not for the vanity of some and the foolishness of others, we might still be at the stage when reCAPTCHA or even simple text CAPTCHAs emerged. The moment some enthusiast found a solution, he promptly posted it for all to see, which in turn prompted CAPTCHA developers to complicate it… My tongue is my enemy…

A similar plot is unfolding with image CAPTCHAs. It was initially thought that, say, recognizing traffic lights in photos is harder for computers than for people because humans have advanced visual perception. But with the revolution in computer vision (deep convolutional networks) this asymmetry disappeared. A modern image-classification model can detect objects with high accuracy—consider how accurately your phone recognizes animals, signs, and other objects in photos. A fresh example: since 2024 an advanced YOLO model can detect traffic lights and other reCAPTCHA v2 images in 100 % of cases, whereas earlier best results were ~70 %. Moreover, an AI bot now has to go through as many pictures as an average human before the system lets it through. One would like to believe the slogan “we have officially entered the post-CAPTCHA era,” where classical checks can no longer distinguish a human from a smart machine, has finally arrived—but it feels like this is not yet the end.

It is important to understand that deep learning not only increased accuracy—it changed the approach itself. Previously, a script had to follow a set of rigid steps: filter the background, split the characters, recognize them separately. Now an end-to-end neural network can be trained: you feed it a CAPTCHA picture, and it outputs the text string (or the probability of the needed class for an image). Auxiliary tasks, like segmentation, the network can learn internally without hand-coding rules. For example, a DenseNet variation called DFCR, coming from China (from where else, if not the nation that once gave the world gunpowder, right?), achieved > 99.9 % accuracy on CAPTCHAs with noise and stuck-together characters—because the deep convolutional network learned to see separate symbols even in a difficult case and confidently classify them.

For clarity, a small table:

Approach to Cracking CAPTCHA	Classical (OCR, scripts)	Modern (AI / neural nets)
Requirements	Filtering rules, OCR libraries, templates. Requires manual tuning per CAPTCHA.	Trained ML model (CNN/RNN). Needs a training dataset but then works more universally.
Character Segmentation	Necessary: must find the boundary of each character before recognition. Frequent failure with interference or merged letters.	Not explicitly needed: an end-to-end model recognizes the full text immediately, covertly segmenting by internal features. Even stuck-together symbols are recognized correctly.
Accuracy on Difficult CAPTCHAs	Limited, often < 90 % with heavy distortions. To improve, heuristics must be added for the specific case.	Near 100 % with sufficient training. Makes fewer mistakes than humans on typical tasks but may be vulnerable to totally new types not in the training data.
Adaptability	Poor transfer to new CAPTCHA types: code/logic must be reworked.	Can be fine-tuned on new data. Universal architectures (e.g., ResNet, Transformer) apply to various tasks.
Solving Speed	High (milliseconds) since algorithms are simple, but on difficult CAPTCHAs may waste time on segmentation attempts.	High: neural nets perform recognition in tens of milliseconds on GPU. A bottleneck is data preparation and, for services, the task queue (discussed later).

As we can see, AI has surpassed classical methods in flexibility and efficiency. But how exactly do neural networks solve the CAPTCHA problem? Let’s examine the main architecture types applied to crack different CAPTCHAs.

Neural Networks vs. CAPTCHA: CNN, RNN, CRNN, Transformers (How AI CAPTCHA Solver Work)

Modern AI CAPTCHA solvers rely on a rich arsenal of deep-learning models. Architecture choice usually depends on CAPTCHA type. Here are the main approaches:

Convolutional Neural Networks (CNN) — specialize in images. CNNs learn to pull out meaningful features from a picture: letter contours, textures, object shapes. Therefore, in CAPTCHAs they are primarily used for character recognition or image classification. A simpler option: train a CNN to recognize individual characters (0–9, A–Z)—then the CAPTCHA image must first be sliced into symbols. A more advanced option feeds the entire CAPTCHA through a CNN, obtaining a feature for each image section. However, CNNs alone do not model a sequence of characters, so in complex CAPTCHAs they are supplemented with recurrent layers.

Recurrent Neural Networks (RNN) — a family of networks capable of processing sequences (data series). In CAPTCHA context, RNNs are used to read text left-to-right, as a person does. For example, you can first extract image features (vector representations of image columns) and feed them into an RNN, which sequentially “reads” them and outputs a sequence of characters. Classic modules — LSTM or GRU — can remember context, which is useful if characters influence each other (say, the algorithm tries to consider probabilities of letter combinations). RNNs are especially helpful for dynamic or sequential CAPTCHAs: e.g., when the user enters several digits appearing one by one, or for audio CAPTCHAs (where the sound sequence must be converted to character sequence). Nonetheless, by themselves RNNs work worse with pictures, so they are often combined with CNNs.

CRNN (Convolutional Recurrent Neural Network) — a CNN and RNN combination that has become the de-facto standard for recognizing text CAPTCHAs and indeed texts in images. A typical scheme: a convolutional network (e.g., several conv + pooling layers) extracts a CAPTCHA feature map that can be treated as a sequence of features along the image width. Then comes a recurrent block (often BiLSTM—bidirectional LSTM), which processes this feature sequence and considers neighboring context. The RNN output is then transformed into a predicted sequence of characters. Such a model is often trained with CTC-loss (Connectionist Temporal Classification), which allows aligning arbitrary output length with the real CAPTCHA text. Thanks to CTC, the model does not need perfect character segmentation—she learns to “stretch” the output to the needed length herself. The result: a CRNN can read an entire CAPTCHA even when characters overlap or their count varies from CAPTCHA to CAPTCHA.

In real projects CRNNs have repeatedly demonstrated their effectiveness. For example, a CNN+BiLSTM model trained on 20 k synthetic CAPTCHAs (random letters and digits with various fonts and noise lines) showed high accuracy on previously unseen CAPTCHAs, and the model guessed even symbols with unfamiliar distortions. When compared with a classical approach (splitting a CAPTCHA image into five parts and classifying each fragment with a separate CNN model) the end-to-end LSTM model was much more reliable and easier to generalize different fonts.

Transformers and Attention — the newest class of models that has conquered NLP and CV. In CAPTCHA context, transformers are still at the research stage but have huge potential. A Transformer handles sequences without recurrence, thanks to a self-attention mechanism. For example, you can take a Vision Transformer (ViT) — split the CAPTCHA image into patches, pass them through self-attention layers to obtain feature vectors. Then apply a text decoder (another transformer) that will generate text based on the picture, “attentively” looking at the necessary image areas via the attention mechanism. Essentially, this is similar to how large models now describe pictures with text. There are already examples where transformers have been successfully applied to crack CAPTCHAs: a Swin-Transformer based architecture showed > 90 % accuracy on complex text CAPTCHAs, surpassing classic CNN+RNN. And in 2023 there were attempts to involve large language models (LLM) for logical CAPTCHAs, though accuracy was a modest ~63 %. But the trend is clear: transformers can combine vision and language, solving even CAPTCHAs with scene descriptions or complex questions.

Generative Adversarial Networks (GAN) — although GANs do not directly “solve” CAPTCHAs, they contributed from another angle. The idea of GANs—adversarial training between a generator and a discriminator—was applied to generate CAPTCHAs resembling real ones, to improve solver training. The idea is simple: a generator creates CAPTCHA images and a discriminator (essentially, a CAPTCHA solver) tries to distinguish generated from real. During training the generator begins to produce CAPTCHAs increasingly difficult for the discriminator—effectively, the network learns on automatically generated “hard” examples, helping to increase recognition accuracy. This approach allows unlimited training data and adapts to new CAPTCHA distortions.

Practice: Tools and Services for Solving CAPTCHA Using AI

Theory is interesting and even at times not boring, but how to apply all the above in practice? Let’s look at existing tools, from open-source libraries to commercial services that position themselves as AI CAPTCHA solvers.

Open-Source: GitHub and Research Communities

In recent years the community has published many projects demonstrating bypassing different CAPTCHAs. A simple GitHub search for “captcha solver ai” or “AI captcha solver GitHub” yields dozens of repositories. As a rule, these are either research projects or utilities sharpened for a particular CAPTCHA.

Text CAPTCHA solvers on neural networks. For example, one project (the CAPTCHA-Solver repository) describes in detail building a CNN+BiLSTM model. The authors generate a set of ~20 k synthetic CAPTCHAs (random letters and digits with different fonts and noise lines) and train the model to recognize sequences of length 5. The code uses PyTorch and TensorFlow, and for image processing — OpenCV and Pillow. Using the pytesseract library as a control, they compare quality. The trained model successfully solves > 95 % of test CAPTCHAs in fractions of a second, whereas standard OCR errs on most complex distortions. Similar projects publish datasets too—for example, here is an open dataset of ~30 k CAPTCHAs from mail.ru and scripts to train a model for them.

Scripts based on OpenCV + OCR. Some repositories offer solutions without deep learning for simple CAPTCHAs. For example, they find contours, pull out characters, and run Tesseract. Or even, as mentioned above, compare with bit templates. Such projects are interesting in their simplicity and can be a basic level: if a CAPTCHA is simple, there’s no need to build neural networks. However, such CAPTCHAs are almost gone on popular sites—spammers have long defeated them, so nowadays more intelligent algorithms are valuable.

Browser extensions and scripts to bypass CAPTCHAs. In web automation, tools are known that can be configured in Selenium / Puppeteer. For example, the open-source extension Buster (for Chrome / Firefox) automatically presses the “play audio” button in reCAPTCHA and sends the file to Google Speech-to-Text API—the obtained text is entered back, bypassing the CAPTCHA for free, or the SolveCaptcha extension that solves CAPTCHA with AI using its internal algorithms. Another example is the 2captcha-solver library (npm, Python) which integrates with the 2Captcha service to send a CAPTCHA for solving and receive the answer in code. GitHub has a “captcha-solving” topic with collections of such tools. Many of them support several services at once (reCAPTCHA, hCaptcha, FunCaptcha, etc.), automatically identifying the CAPTCHA type on a page. An open-source tool usually provides a convenient API, and “under the hood” may use either external services or built-in models (like Buster for audio or SolveCaptcha for other CAPTCHA types).

GitHub has not only CAPTCHA-solving projects—some are dedicated to CAPTCHA generation. For example, the captcha library in Python allows generating typical text CAPTCHAs for training models.

Commercial Services: Humans (Human CAPTCHA Solver ) vs. Machines (AI CAPTCHA Solver )

If you have neither the desire nor the opportunity to develop your own neural network, ready-made services come to the rescue—historically they relied on human labor: you send a CAPTCHA image to the server, and a real person within a few seconds looks and enters the answer, which you get back. Classic representatives: 2Captcha (aka RuCaptcha), SolveCaptcha, DeathByCaptcha, etc. Now AI-only services enter the market offering CAPTCHA solving without human participation—faster and cheaper. Let’s briefly look at the main options and their characteristics:

For clarity a comparison table of AI services with human ones:

Service / Approach	Example	Solution Method	Time and Success	Approximate Cost
Human	2Captcha, Anti-Captcha	Live people worldwide type CAPTCHAs for pay.	~7–20 s on reCAPTCHA (images faster). ~99 % accuracy (several people answer if necessary).	~$2–3 per 1 000 solved reCAPTCHAs (~$0.002 each). About $0.5–1 per 1 000 simple text CAPTCHAs.
Artificial Intelligence	noCaptchaAi	Specialized neural networks and browser emulation.	~5 s on reCAPTCHA v2 (often limited by CAPTCHA’s own minimum time). Accuracy up to 99 % on supported types. Possible failures on brand-new types.	~$0.8–1 per 1 000 solved reCAPTCHAs (~$0.0008 each).
Hybrid	SolveCaptcha (extension), others	Tries AI first; if unsuccessful, involves a human.	Combines pros: AI instantly solves easy 80–90 %, humans finish the rest. Total time ~5–15 s, success ~99.9 %.	~$1–2 per 1 000 (price depends on the share of tasks for humans)

Note that human services may face worker unavailability at times, while AI services may require model updates when CAPTCHA algorithms change.

As we see, AI CAPTCHA solvers are already economically advantageous in many cases. It is not surprising that even classic services begin to implement machine learning to avoid losing the market to competitors.

The Future of AI CAPTCHA Solvers in the AI Era

There is a sense that the current CAPTCHA market situation looks like a tipping point: classic Turing tests for users are doing an ever poorer job. AI has learned to read distorted text, see objects in photos, and decipher audio—sometimes with superhuman accuracy. Add synthetic data, GANs, and distributed computing, and any specific CAPTCHA will sooner or later be cracked by a machine.

CAPTCHA developers, of course, are not sitting idle. A significant shift toward invisible checks (behavioral factors) and use of extensive contextual data (browser history, device parameters—even analysis of mouse movements, phone tilt angle, and more) is observed. Ideally the check should occur so the user does not feel it. For example, Cloudflare Turnstile asks no questions at all, performing a “security check” in the background—and in their opinion, this is more effective than classical CAPTCHAs. Another trend is multilayer authentication: before showing a CAPTCHA, the system analyzes whether it already knows the user (logged in, has a token, origin). Possibly the CAPTCHA of the future will move entirely from the UI to the backend, and for suspicious cases measures like SMS verification or biometrics will be applied (which is already beyond classical CAPTCHA).

With the development of web protocols and identification, we may eventually access the web via a trusted attestation (through a government portal account, a device, or a digital passport)—and then extra checks will no longer be needed. But there are people who voluntarily give their government accounts to scammers…

No, this is definitely not the end!

Hubs: