A team of researchers from the universities of Arizona, Georgia and South Florida have developed a machine learning-based CAPTCHA solver that they claim can overcome 94.4% of real-world challenges on dark websites.
The goal of the study was to create a system that can streamline cyber threat intelligence, which currently requires human intervention to manually solve CAPTCHAs.
The costs of cybercrime are rising exponentially, with cyberattacks and data breaches occurring every day. As such, having a way to make the dark web more transparent for search is essential for taking targeted preventative action.
Dark web CAPTCHA
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is used by websites to differentiate real users from bots.
These challenges are ubiquitous on the dark web to protect platforms from the constant DDoS attacks that competing platforms launch against each other.
These DDoS attacks are carried out by botnets, and so having a strong CAPTCHA layer on the login page can keep the threat under control.
However, each website implements its own custom CAPTCHA challenge, which makes it nearly impossible to develop a tool that can solve most of them.
As such, collecting cyber threat intelligence from illicit dark web marketplaces and forums becomes difficult and costly as employees need to be involved in the CAPTCHA resolution step.
Machine learning approach
To solve this practical problem, the researchers developed a system that relies on the interpretation of pixelated images, qualitatively different from other recent studies which also used adversarial network-based generative approaches.

Source: Arxiv.org
The new solver can distinguish between letters and numbers by looking at them one by one, denoising the image, identifying their boundaries between letters, and segmenting the content into individual characters.

Source: Arxiv.org
As such, the size of the CAPTCHA challenge does not affect solver efficiency much, especially when measuring cumulative performance for three attempts.

Source: Arxiv.org
When it comes to character recognition, the solver uses samples taken from multiple local regions to identify fine features like lines and edges, so it can’t be “fooled” by character rotation, font size changes or color blends.

Source: Arxiv.org
Tests in real conditions
Using their most optimized DW-GAN resolution model, the researchers tested it against Yellow Brick, a now-defunct dark web marketplace that hosted lists of illicit content.

Source: Arxiv.org
Like the paper Explain :
Using a crawler enhanced by our DW-GAN, we were able to collect 1,831 illegal products from Yellow Brick. Among these proceeds were 286 cybersecurity-related items, including 102 stolen credit cards, 131 stolen accounts, 9 scans of forged documents, 44 hacking tools, and 1,223 drug-related products.
Overall, gathering information on the “Yellow Brick” market with DW-GAN took about 5 hours without human intervention. In particular, each HTTP request took 8.8 seconds to load a new web page; therefore, crawling 1,831 pages took 268.5 minutes. Solving the recurring CAPTCHA challenges (for 15 HTTP requests) took our crawler DW-GAN 18.6 seconds.
Overall, the proposed framework could automatically break CAPTCHA with no more than three attempts. Breaking all CAPTCHA images takes about 76 minutes [sic] in total for all 1,831 product pages, a fully automated process.
Of course, this test data is for a particular dark web market, but a similar level of performance is expected on any site using CAPTCHA words, according to the researchers.
Potential consequences
Intelligence and high performance CAPTCHA solvers like this have the potential to disrupt the space, at least on the dark web where less sophisticated challenges are used.

Source: Arxiv.org
The authors have published the final version of their solver on GitHub, but not the training dataset of 50,000 CAPTCHA images.
Someone could presumably work on this pattern to derive something that also works on weak clearnet CAPTCHA implementations.
As the article regarding this possibility points out: “Although this study mainly focuses on the CAPTCHA of the dark web as a more difficult problem, the method proposed in this study should be applicable to other types of CAPTCHA without loss of generality”.
This new solver may have been developed for the noble purpose of fighting cybercrime, but it still has the potential to impact those who use the dark web for anonymity and safe information exchange. security.