2015-12-06: Initial 2016-09-15: Added : Fast Teardown Use ipset centric logic Sense if TARPIT available Logging Enhancements 2017-04-04: Added : Whitelisting + Smart recidivism Added : Geo Tracking + Browser Fingerprinting
Welcome to Razor Wall!
A simple idea...
Razor shred anyone attacking your site + leave them squirming + screaming on your Razor Wall.
This site distills my 20+ years Linux Machine Hosting + Admin (1994-present) blocking attacks including using fail2ban + fail2ban enhancements I've been considering for for high traffic sites. Might be I'll open tickets against fail2ban to have it's logic extended or more likely eventually create a new product. While fail2ban is fantastic software, as of fail2ban-0.9.2, there are several areas of improvement which might be pursued.
In the case of Bot Blocking, some bots must be whitelisted. Examples include...
The normal approach for dealing with these bots is to check the UA (User Agent String), to determine if an IP should be blocked or allowed. This is cumbersome, as UAs constantly change. This approach is also broken, as anyone can forge a UA.
Solution: Keep an ipset for whitelisted IPs, like all GoogleBot IPs, which can be be accurately acquired + updated via DNS in an automated fashion.
Currently fail2ban handles Recidivism crudely by tracking duplicate /var/log/fail2ban.log entries + blocking IPs for a day or week, based on bad behavior. Adding granularity to this facility means writing a specific filter for each entry to track. Maintaing all these rules becomes cumbersome.
Solution: Allow Recidivism to be defined at a jail/filter level.
Simple facility. Load all IPs for a GEO into an ipset, then treat entire ipset in certain ways. One example might be to treat all non-English speaking GEOs as suspect for a site with English only content. Then force every visit from an IP in these GEOs to solve a CAPTCHA before being allowed to proceed. Once solved, IPs should be whitelisted, so likely best to maintain a whitelist + blacklist for each GEO.
Most security tools are IP based. Exceptions include Apache's mod_security + mod_security2, which are heavy weight + tend to be inaccessible for mere mortals + suck tremendous resource.
Today many IPs serve as VPN or NAT drops, meaning there can be 100s or 1000s or more devices all sharing a single IP, so doing any type of IP based logic becomes complex.
Another problem exists using an IP approach. All current WiFi protocols are hackable given a packet flow of roughly 85,000 packets. This low packet count means any IP can potentially be hosting some sort of Bot or other nefarious software, which has gained IP access by hacking either a residential or office WiFi network.
Solution: Solving this for non-HTTP protocols is tricky. For HTTP protocols, Apache's advanced filtering layers can be used + integrated with other types of server services, to provide true browser fingerprinting.
This means RazorWall's code should be tooled from the beginning to support both IP based actions + Browser Fingerpring based actions.
Allow process threading, meaning allow spawning of separate process for high churn log files, like /var/log/apache2/access.log or netstat loggers.
Use process threading, verses fail2ban monolithic program to allow more complex, custom logic + support for any script or compiled language.
Rather than teardown by individual rules, which can take many minutes with fail2ban, teardown by flushing chains.
Run teardown process when razorwall starts too. This allows cleanup if razorwall dies for some reason + no teardown occurs.
Rather than using iptables, use ipset to speed up adding/deleting/processing on high traffic machines which may produce 1000s of iptables rules.
Prefer -j TARPIT over -j REJECT --reject-with icmp-port-unreachable
Provide razorwall-tarpit package which installs weird dependencies on Ubuntu.
Provide more verbose/descriptive logging. Allow regular expressions (failregex + ignoreregex) to be named + allow named expression + exact part of regular expression match to be logged.
Provide mechanism to do iptables logging of matched rules for client reporting of number of attacks blocked on day + week + month + year + total time increments.
Change to generic accumulation model where...
Some machines I host generate 10G-20G+/day of log files, which means the sun will go nova before fail2ban even gets started.
Just starting fail2ban-server requires log files to be completely read.
Recently I started up fail2ban on a server with mild traffic, which had a 69M /var/log/apache2/access.log file + had to kill fail2ban after 4 hours, as it was still reading.
A similar problem exists in fail2ban-regex where on the same 69M file, I had the kill fail2ban-regex after 4 hours of running.
NOTE: The fail2ban startup problem has been resolve in fail2ban by introducing the [tail] argument after logfile(s) in the logpath setting for each jail.
NOTE:The fail2ban-regexp problem still exists + the work around is to extract a snippet of log file lines to test into a test file.
When a slowloris (DOS) attack targets a site, fail2ban seems to somehow lose it's marbles (technical term). Easiest mitigation approach is to tune Apache to survive DOS/DDOS long enough to start logging 400 or 408 errors. Then use fail2ban to catch these + block them.
Unfortunately, this fails on machines with multiple/fast CPUs + large amounts of memory. Behavior of fail2ban is very odd. Apache starts logging 400/408 errors + fail2ban seems to never see these. If slowloris is stopped Apache processes taper off from whatever their max is set to be (1000-2000 is a good number, in most cases). Apache processes eventually drop back down to Apache's MinSpareServers setting. And fail2ban never seems to wake up. I suspect this is a bug in the python-pyinotify code, because if Apache is cycled in the middle of an attack, then once Apache stop + restarts + starts to log, fail2ban works.
Even lowering priority of fail2ban, so fail2ban runs at a higher priority than Apache seems to have no effect.
Even - netstat -ntp - shows TCP resources have been released.
Using - ps -eo pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm | egrep 'apache|fail2ban' - shows 100s of apache processes in lock_file_wait state + fail2ban scheduled. And - lsof | grep access.log| wc -l - shows 1000s access.log file descriptors open. How Apache logs access.log for writing may have something to do with this situation + I'm still unsure why, as inotail uses the inotify() API, same as python-pyinotify, so likely we're back to a python-pyinotify bug of some sort.
Trying to debug this would be difficult. Using inotail piped into a perl script works far better.
Does appear to be some sort of fail2ban bug, as it appear correct records are written to the sqlite3 database, because if fail2ban is bounced (stop/restart), correct bans are built for the [apache-dos] jail (400/408 trashing) + also [recidive] if [apache-dos] ban time is set low.
Currently fail2ban only share state across a single filter rule match. If simple rules changed into state machines, then the following + common situation could be caught + blocked.
Having rules as state machines provides a way to say "block any IP generating a 301 with no immediate 200", for 5 minutes. Then with the recidivism jail running, multiple of these can be promoted to day or week long blocks.
Copyright © 1994-2015 by By David Favor