The first text-based CAPTCHA ( we’ll call it just CAPTCHA for the sake of brevity ) was used in 1997 by AltaVista search engine. It prevented bots from adding Uniform Resource Locator (URLs) to their web search engine.
Back then it was a decent defense measure. However the progress can't be stopped, and this defense was bypassed using OCR available at those times (for example FineReader).
CAPTCHA became more complex, noise was added to it, along with distortions, so the popular OCRs couldn’t recognize this text. And then OCRs custom made for this task appeared. It costed extra money and knowledge for the attacking side. The CAPTCHA developers were required to understand the challenges the attackers met, what distortions to add, in order to make the automation of the CAPTCHA recognition more complex.
The misunderstanding of the principles the OCRs were based on, some CAPTCHAs were given such distortions, that they were more of a hassle for regular users than for a machine.
OCRs for different types of CAPTCHAs were made using heuristics, and the most complicated part of it was the CAPTCHA segmentation for the stand along symbols, that subsequently could be easily recognized by the CNN (for example LeNet-5), also SVM showed a good result even on the raw pixels.
In this article I’ll try to grasp the whole history of CAPTCHA recognition, from heuristics to the contemporary automated recognition systems. We’ll figure out, if a CAPTCHA is still alive.
I’ll review the yandex.com CAPTCHA. The Russian version of the same CAPTCHA is more complex.
It looks like all it takes is just to binarize the image (convert it to black and white), and it’s done, because the segmentation part looks easy enough.
In order to lower the efficiency of the heuristic algorithms Yandex focused on the binarization complexity: there are some CAPTCHAs having a different binarization level for different parts of an image, and it’s required to select it adaptively.
An average CAPTCHA consists of 14 symbols. If we’ve got a classifier with the efficiency of 99%, and all the CAPTCHA images are segmented correctly, it would give us 87% correctness for a two-words CAPTCHA. It should be noticed that the model complexity has a connection with number of classes (letters, numbers, signs) used in the whole CAPTCHA set – the more classes there are, the more complex the task to achieve the high efficiency of a separate symbol recognition on a basic model. That’s why yandex.ru CAPTCHA is more complex – it uses both Russian and English words.
We can see the downsides of the CAPTCHA as well. The letters will be easily segmented after the correct binarization, also we can use dictionary check.
Here comes the description of the recognition algorithm, along with training and learning sets.
The data collection stage
Let’s download some CAPTCHAs and split them into training and test sets.
We used VPN to download the CAPTCHA images from yandex.com. When I tried to create an account manually using browser the system detected me as a bot. That’s why I guess I got more complex CAPTCHAs than usual. I quote: “On the second stage if we still think that the request is suspicious, however the level of certainty is not high, we’ll show the simplest CAPTCHA. When we think that we met a robot, the complexity could be raised. Simple and effective”. It looks like the browser type didn’t affect the result (Opera, Chrome, Edge). Overall we got 4847 CAPTCHAs in the training set and 354 CAPTCHAs in the test set. We spent a few bucks on recognizing them with a decaptcher.com service.
Heuristic recognition algorithm
The algorithm consists of a few steps, they are binarization, noise cleaning, two words extraction. And if necessary – normalizing the slant of every word. With subsequent segmentation and recognition.
1. Binarization of an image – transforming of a color image or a gray scale image into black and white image. As a result we get background (0) and object (1) with noise reduction. The binarization algorithms will be described in the end. Next we’ll get the parameters for every stage, we’ll optimize them later. From now on we deal with binarized images only.
2. Cleaning the binarized image from noise. We’ll extract the connected parts of an image – they are our objects. Next based on the number of pixels in the object we’ll split them into object and noise. Everything inside our predefined range (A till B pixels) are objects, the rest must be filtered. In this case, for example, the letter i dot can be recognized as a noise. So to avoid it we’ll use the proximity parameter C, which is a distance between a small noise and a good object. The A, B and C parameters are found with a simple brute force approach, with the goal to maximize the number of correctly recognized CAPTCHAs.
3. Extracting two words from an image. The space between two words is approximately in the center of an image. The exact number we’ll find using the X-sized window, it will be another optimization parameter. Next we’ll get the horizontal brightness histogram in this window. We’ll use yet another value – the threshold, which will determine if it’s a space or an object in the column. The maximum number of spaces in a row will mean that we found the place of separation of two words. We need to select two parameters in this procedure – the optimal window width and the threshold.
4. Slant normalization (for the CAPTCHAs it’s needed for). In order to get the higher recognition accuracy of letters, we’ll need to normalize the slant in some CAPTCHAs. In order to do it we’ll use the morphological closure operation of closing, and we’ll get one object. We’re detecting the orientation of this object, i.e. the angle between the X axis and the main axis of the ellipse, in which the object is inscribed.
5. Segmentation. At this point we’ve got images of two words, now we need to split them into separate letters for subsequent recognition. In our case the letters are not intersected, we can extract connected areas from the image. They will be the required symbols we can use for recognition.
6. Symbols recognition. In order to recognize the symbols I used the Lenet-5 convolutional network. In this case there will be 26 classes, the original one had 10. For the pre-training I created a set of 52,000 images of letter (2,000 for a class), selected a few font-types, added different distortions and trained the convolutional network. By selecting parameters for binarization and segmentation we get the set for the classifier training. This set will be used on the second and subsequent steps of the classifier training. The first step of the training was made with synthetically generated symbols. The process is iterated, overall I gathered 47,000 images of letters. The classes were unevenly distributed, but it was as expected, because the CAPTCHA used common words, not randomly generated strings. The final classifier had 98.48% accuracy.
Here is the recognition results depending on the binarization approach. At the beginning I used a single binarization threshold for all CAPTCHAs. The threshold was selected on the training set, this approach gave me 15% accuracy on the training set and 15% on the test set.
Using the Otsu's method gave 13,6% accuracy on the training set and 12,25% on the test set.
The Sauvola binarization method gave 26,01% accuracy on the training set and 25,8% on the test set.
Figuring out that the CAPTCHAs could be split into groups based on their background, let’s proceed with the clusterization. The features will be extracted using the Zoning method. It means that we’ll split the image into number of non-intersecting areas of a set size, then we’ll extract the average brightness of each area. The lesser the window size, the more accurate the description of the image. For the splitting we’ll use the K-means method. The classical methods of detection of the optimal numbers of clusters gave 2 clusters. Empirically the number of clusters was selected as 5. The clusters are like follows: 1 – 842 images, 2 – 1300 images, 3 – 1237 images, 4 – 770 images, 5 – 698 images. For every of these clusters we need to select it’s own binarization parameters of the Sauvola algorithm on the training set. As a result we get the accuracy of 31.22% on the training set, and 30.8% on the test set. Here are the examples from every cluster:
At this point we’ve got the correctly binarized CAPTCHAs. We can try a more serious approach for the binarization, like U-Net, which is the network with 13.5+ million parameters. Thus we get 39.2% accuracy on the training set and 38.7% on the test set.
So, what about a fully automated recognition model creation process, which includes no heuristic? We just have CAPTCHAs with the according answers as input and the OCR model as output. Recently the Keras library site released the sources that with little modifications can be used for recognition of any text-based CAPTCHA. https://keras.io/examples/vision/captcha_ocr/
With this approach we get training set accuracy of 55%, and the 39% accuracy on the test set. There was an overfitting because of the small size of the training set for such a big network. However with correctly selected parameters of regularization the network was trained successfully. If we increase the training set by adding some reflected images we get 58% accuracy on the training set and 43% accuracy on the test set. Increasing the training set size by adding new CAPTCHAs improves the accuracy.
The network often makes mistakes on CAPTCHAs of the following type:
This is consistent with the Yandex CAPTCHA authors point in their article: “The most complicated datasets with recognized words nowadays are heavily distorted texts(irregular text recognition).” However in our particular case it’s rather the architecture limitation of the network. The features we have got after CNN layers are recognized with the lstm layer subsequently left to right, noticing that in some cuts there might be multiple symbols. This could be easily demonstrated by creating a set of the following CAPTCHAs (see fig.1) of 10,000 training set size and 1,000 of tests. When the network is trained we get just as few as 20% of correct answers and 80% of incorrect ones.
However if we synthesize CAPTCHAs by writing text in one line and by adding distortions, thus making it more complex (see fig.2), then we get 97% of correct answers, and 3% of incorrect ones.
In order to increase the number of correctly recognized CAPTCHAs from Yandex we can use a text detector, with subsequent tilt normalization and then neural network recognition. These are the examples of complex CAPTCHAs with text detector (fig.3).
With the text detector we get 60% accuracy on the training set and 51% on the test set. I used the synthetically pretrained text detector called - CRAFT: Character-Region Awareness For Text detection. And the recognition of separate words was done by the network trained on Yandex CAPTCHAs.
Yandex developers during the engineering of the CAPTCHA were solving two problems: making a possible OCR less effective while making the CAPTCHA user-friendly. It’s worth noticing that often we can see the approach based on the assumption that a CAPTCHA difficult for a human recognition would be hard for a machine recognition too (fig.4)
However that’s not correct, a neural network can easily ‘break’ such CAPTCHAs with the accuracy higher than a human. It doesn’t require figuring our some heuristic algorithms, it’s enough to download a ready network and train it (if needed with some layers manipulation and regularization), it will solve 99% of text -based CAPTCHAs.
As a result we can state that engineering a text-based CAPTCHA nowadays requires understanding of methods and approaches that could be used for it’s recognition. And based on this knowledge create a CAPTCHA that will effectively withstand bots and at the same time would be easy for human recognition.
Please feel free to share your thoughts in the comments, how would a text-based CAPTCHA resistant to machine recognition look like?