Razor Wall
Blazing Fast Website Attack Blocking


Changelog

2015-12-06: Initial
2016-09-15: Added  : Fast Teardown
                     Use ipset centric logic
                     Sense if TARPIT available
                     Logging Enhancements
2017-04-04: Added  : Whitelisting + Smart recidivism
            Added  : Geo Tracking + Browser Fingerprinting


Welcome + Background

Welcome to Razor Wall!

A simple idea...

Razor shred anyone attacking your site + leave them squirming + screaming on your Razor Wall.

This site distills my 20+ years Linux Machine Hosting + Admin (1994-present) blocking attacks including using fail2ban + fail2ban enhancements I've been considering for for high traffic sites. Might be I'll open tickets against fail2ban to have it's logic extended or more likely eventually create a new product. While fail2ban is fantastic software, as of fail2ban-0.9.2, there are several areas of improvement which might be pursued.

Enhancement Areas of Current fail2ban Implementation

Whitelisting

In the case of Bot Blocking, some bots must be whitelisted. Examples include...

The normal approach for dealing with these bots is to check the UA (User Agent String), to determine if an IP should be blocked or allowed. This is cumbersome, as UAs constantly change. This approach is also broken, as anyone can forge a UA.

Solution: Keep an ipset for whitelisted IPs, like all GoogleBot IPs, which can be be accurately acquired + updated via DNS in an automated fashion.

Smart recidivism

Currently fail2ban handles Recidivism crudely by tracking duplicate /var/log/fail2ban.log entries + blocking IPs for a day or week, based on bad behavior. Adding granularity to this facility means writing a specific filter for each entry to track. Maintaing all these rules becomes cumbersome.

Solution: Allow Recidivism to be defined at a jail/filter level.

GEO Tracking

Simple facility. Load all IPs for a GEO into an ipset, then treat entire ipset in certain ways. One example might be to treat all non-English speaking GEOs as suspect for a site with English only content. Then force every visit from an IP in these GEOs to solve a CAPTCHA before being allowed to proceed. Once solved, IPs should be whitelisted, so likely best to maintain a whitelist + blacklist for each GEO.

Browser Fingerprinting

Most security tools are IP based. Exceptions include Apache's mod_security + mod_security2, which are heavy weight + tend to be inaccessible for mere mortals + suck tremendous resource.

Today many IPs serve as VPN or NAT drops, meaning there can be 100s or 1000s or more devices all sharing a single IP, so doing any type of IP based logic becomes complex.

Another problem exists using an IP approach. All current WiFi protocols are hackable given a packet flow of roughly 85,000 packets. This low packet count means any IP can potentially be hosting some sort of Bot or other nefarious software, which has gained IP access by hacking either a residential or office WiFi network.

Solution: Solving this for non-HTTP protocols is tricky. For HTTP protocols, Apache's advanced filtering layers can be used + integrated with other types of server services, to provide true browser fingerprinting.

This means RazorWall's code should be tooled from the beginning to support both IP based actions + Browser Fingerpring based actions.

Process Multi-Threading

Allow process threading, meaning allow spawning of separate process for high churn log files, like /var/log/apache2/access.log or netstat loggers.

Use process threading, verses fail2ban monolithic program to allow more complex, custom logic + support for any script or compiled language.

Fast Teardown

Rather than teardown by individual rules, which can take many minutes with fail2ban, teardown by flushing chains.

Run teardown process when razorwall starts too. This allows cleanup if razorwall dies for some reason + no teardown occurs.

Use ipset centric logic

Rather than using iptables, use ipset to speed up adding/deleting/processing on high traffic machines which may produce 1000s of iptables rules.

Sense if TARPIT available

Prefer -j TARPIT over -j REJECT --reject-with icmp-port-unreachable

Provide razorwall-tarpit package which installs weird dependencies on Ubuntu.

Logging Enhancements

Provide more verbose/descriptive logging. Allow regular expressions (failregex + ignoreregex) to be named + allow named expression + exact part of regular expression match to be logged.

Provide mechanism to do iptables logging of matched rules for client reporting of number of attacks blocked on day + week + month + year + total time increments.

Data Accumulation

Change to generic accumulation model where...

  1. Plugins to parse log file records into fields.
  2. Describe which fields to accumulate as entities.
  3. Allow perl closures to be used anywhere.
  4. Provide logging of blocked attempts, so LOG rule at top of each chain before REJECT rule.

Logfile Size

Some machines I host generate 10G-20G+/day of log files, which means the sun will go nova before fail2ban even gets started.

Just starting fail2ban-server requires log files to be completely read.

Recently I started up fail2ban on a server with mild traffic, which had a 69M /var/log/apache2/access.log file + had to kill fail2ban after 4 hours, as it was still reading.

A similar problem exists in fail2ban-regex where on the same 69M file, I had the kill fail2ban-regex after 4 hours of running.

NOTE: The fail2ban startup problem has been resolve in fail2ban by introducing the [tail] argument after logfile(s) in the logpath setting for each jail.

NOTE:The fail2ban-regexp problem still exists + the work around is to extract a snippet of log file lines to test into a test file.

Python or fail2ban is to Slow For DOS/DDOS Mitigation

When a slowloris (DOS) attack targets a site, fail2ban seems to somehow lose it's marbles (technical term). Easiest mitigation approach is to tune Apache to survive DOS/DDOS long enough to start logging 400 or 408 errors. Then use fail2ban to catch these + block them.

Unfortunately, this fails on machines with multiple/fast CPUs + large amounts of memory. Behavior of fail2ban is very odd. Apache starts logging 400/408 errors + fail2ban seems to never see these. If slowloris is stopped Apache processes taper off from whatever their max is set to be (1000-2000 is a good number, in most cases). Apache processes eventually drop back down to Apache's MinSpareServers setting. And fail2ban never seems to wake up. I suspect this is a bug in the python-pyinotify code, because if Apache is cycled in the middle of an attack, then once Apache stop + restarts + starts to log, fail2ban works.

Even lowering priority of fail2ban, so fail2ban runs at a higher priority than Apache seems to have no effect.

Even - netstat -ntp - shows TCP resources have been released.

Using - ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm | egrep 'apache|fail2ban' - shows 100s of apache processes in lock_file_wait state + fail2ban scheduled. And - lsof | grep access.log| wc -l - shows 1000s access.log file descriptors open. How Apache logs access.log for writing may have something to do with this situation + I'm still unsure why, as inotail uses the inotify() API, same as python-pyinotify, so likely we're back to a python-pyinotify bug of some sort.

Trying to debug this would be difficult. Using inotail piped into a perl script works far better.

Does appear to be some sort of fail2ban bug, as it appear correct records are written to the sqlite3 database, because if fail2ban is bounced (stop/restart), correct bans are built for the [apache-dos] jail (400/408 trashing) + also [recidive] if [apache-dos] ban time is set low.

Sharing State Across Hosts

Currently fail2ban only share state across a single filter rule match. If simple rules changed into state machines, then the following + common situation could be caught + blocked.

  1. Reference made to http://www.example.com which 301s to the bare domain.
  2. Real browsers will issue the bare domain to to http://example.com (www stripped) directly after the first 301.
  3. So for real browsers you'll have a 301 + 200 for real page, very close to each other in time.
  4. Stupid Scrappers + Bot Farms obfuscate themselves by throttling/delaying requests, so will generate a 301 + then one of these...
  5. Catching the initial IPs generating dangling or nonsensical (non-browser) 301 request with no immediate 200, would dramatically stop attacks on high traffic sites.

Having rules as state machines provides a way to say "block any IP generating a 301 with no immediate 200", for 5 minutes. Then with the recidivism jail running, multiple of these can be promoted to day or week long blocks.

Solution Roadmap

  1. TODO: Use Next Gen Tools - nft + xtables-addons (TARPIT/DELUDE/CHAOS) + ipset as seem appropriate.
  2. DONE: Startup Jail Generation - simple jail.d/*.conf processing
  3. TODO: Better Ruleset Config - new ruleset config file format which merges jail + filter + action config into a single file, with fallbacks to common configs, like fail2ban currently implements. Build in flexibility to extend filter or action syntax, to set the stage for state machine logic.
  4. TODO: Adaptive Ruleset Config - where rulesets like honeypots can be generated for all sites on a machine, where each site has a unique honeypot key which changes every time the Banner is bounced (stopped/restarted). This implies the ability to run programs every time the Banner starts, so a "Disallow: honeypot-$key.txt" may be added to all robots.txt files or in the case of a CMS like http://WordPress.org
  5. DONE: High Traffic Site Ruleset - including start of DDOS TARPIT logic
  6. DONE: Full Tarpit Ruleset - which requires...
  7. TODO: Proof of Concept - using inotail + alarm signal processing
  8. TODO: Rewrite - using...
  9. TODO: State Machine Syntax - likely most flexible is to define rules as perl classes or closures. This allows evolution to track any new case scenarios which might be imagined.
  10. TODO: Machine Rings - share ban/unban action between participating machines. This opens the door for highly effective global DDOS countermeasures + also propagating data from a master to many slaves, so in case of a fallover of a site from a master IP to slave IP, all same bans are already active.
  11. TODO: Enhanced Logging Granularity - allow a tag to be associated with each failregex, so logging then becomes [jail:tag] for all logging, rather than just [jail], to quickly tell which regexp is firing for any given match.
  12. TODO: Enhanced Logging Statistics - track bans/jail + bans/[jail:tag] for client reports.
  13. TODO: Report Generation - total number of attacks blocked + in progress of tracking on site-wide + global basis.
  14. TODO: Handle New Logs - track directory part of log files + catch additions of new log files when they appear.
  15. TODO: Pass Matched Patterns to Logging - to delineate what part of patterned matched, for example, surfacing exactly what matched for a ruleset matching all 400 + 500 Apache errors.
  16. TODO: Pass Ruleset Variables to Logging - for example, [rule[:tag][:time] to show ban time in logs.

Copyright © 1994-2015 by