Bot or human? Detecting malicious bots with machine learning in 2021


Authors:   Benjamin Fabre, Antoine V


The presentation discusses the use of machine learning and various techniques to detect and block bad bots in real-time.
  • DataDome collects billions of signals a day to detect bad bots
  • Different approaches are used to obtain the best bot detection possible
  • Machine learning techniques are used to detect credential stating attacks
  • Verification of good bots is necessary but can be difficult
  • Bad signatures can be extracted from different categories of signals
  • A wide range of machine learning approaches are used to detect bad bots
The presentation mentions how Facebook's link preview feature was being used as a proxy by web scraping bots, but DataDome contacted Facebook and they applied rate limiting to fix the issue.


Abstract:Detecting malicious bots has become an extremely complex task. Bot developers deliberately design their software to bypass bot detection systems. They attack from perfect browsers and mobile apps, leveraging exactly the same browsers as humans or headless browsers like Headless Chrome. They know how to forge attributes that are commonly used for bot detection: they manipulate HTTP headers and their values and order, and change their browser fingerprints. Bad bots are also distributed in extremely elaborate ways. Many use residential IPs with excellent reputations, and they make very few requests per IP — sometimes only one. Finally, the best bots perfectly mimic human behavior. For example, they can imitate realistic mouse movements and keyword strokes, using generative adversarial networks.So what does it take to efficiently distinguish advanced bots from real humans?This talk will reveal the inner workings of a modern bot detection engine. We will see which signals are collected, and how they are enriched. We will discuss why it is mandatory to analyze both server-side and client-side signals. We will explore the challenges of authenticating good bots, and how to detect frameworks such as Puppeteer extra stealth, Playwright, Selenium and Headless Chrome. Finally, we will take a deep dive into machine learning approaches for bad bot detection, with a demonstration of how the respective strengths of supervised and unsupervised machine learning can be combined for maximum predictive accuracy.Outline: Intro: What does a bad bot look like in 2021?1.1. Bots use perfect browsers and apps1.2. Bots attack from clean IP addresses1.3. Bots run on real devices1.4. Bots behave like humansOverview of current bot detection techniques2.1. Signals: why you need both server-side and client-side signals2.2. IP reputation: how to extract valuable data from the humble IP address2.3. So you say you’re Google? Authenticating good bots2.4. Signature-based detection for simple bots2.5. Detecting advanced bots with machine learningDeep dive: Machine learning approaches for bot attack detection3.1. Detecting proxies, forged headers, URL browsing, and more with supervised ML3.2. Detecting Captcha farms with semi-supervised ML3.3. Outlier detection with unsupervised ML3.4. Detection techniques for single-request attacks4. Feedback loops: managing false positives and preserving the human user experience