Continuing to look into the recent issues, I think the bots are causing a possibly unintentional slowloris attack. We are seeing really large amounts from traffic from a small number of ips and the site handles it okay for a while but at some point (maybe while there are transient network issues somewhere) the httpd process count starts spiking and we reach the max servers number and no more httpd process can get started. I've already bumped the max servers a couple times to take advantage of the newly increased server memory.
I've added some mod_reqtimeout configuration that I hope will help if it is a slowloris issue.
I've also added additional bot configuration to phpBB which generates output a little differently. Most importantly it leaves off the session id query parameter so the bots get less "unique" urls if they are still considering the sid query parameter to be part of a unique URL. Besides the obvious Bing/Googlebot traffic, the newly configured large traffic bots are:
DotBot
PetalBot
SemrushBot
Amazonbot
Neevabot
Was interesting that Amazon/Alexa is crawling the web, makes sense if they are going to compete with Google on voice search front.
The worst offender by far is Neevabot, over a million requests to the site in just a couple weeks. If the phpBB bot settings don't help with them I might have to block their ip.
The corporate bots aren't the only offenders, we have what appear to be several individuals that are concerned about having copies of the wiki and have implemented bots of various forms to try and archive the wiki. If the bots were all well written to only get the content it wouldn't be much of a problem but most of them tend to do things like archive the user pages, the talk pages, the special pages, and every single page diff. Some of them are causing 40k hits per day to the site so I'll probably need to block them also.
We do have a copy of the wiki available for download at
https://files.osdev.org/osdev_wiki.zip if you want the information in an offline form, its not perfect but it mostly works. It was a little stale because the generation was hanging. I've fixed that. Issue there was that people have been uploading larger animated gifs for their OS images so I had to adopt the change from
https://gerrit.wikimedia.org/r/c/mediaw ... e/+/91501/ to allocate more memory to the image conversion process.
The wiki archive is a simple wget command:
Code:
wget --inet4-only --no-check-certificate --mirror -k -p --reject '*=*,User:*,Special:*,User_talk:*' --exclude-directories='User:*,User:*/*,User:*/*/*,User_talk:*,User_talk:*/*,User_talk:*/*/*,Special:*,Special:*/*,Special:*/*/*' --user-agent="osdev-mirror" https://wiki.osdev.org/Main_Page
If anyone wants to suggest better options for generating an offline copy of the content pages in the wiki or post processing that should be performed I'd welcome it.