There were per-site configured rules defined but the regexp was slightly
incorrectly defined. However, we should just simply never crawl urls
like this unless they are normalized, so for now just add them to the
hardcoded exclusion rules.
We only support it for our main website, which uses a sitemap, so
implement it only for that provider. And always probe
sitemap_internal.xml, since we don't even try to access any external
sites on it.
Replaces the old search code with something that's not quite as much
spaghetti (e.g. not evolved over too much time), and more stable (actual
error handling instead of random crashes)
Crawlers are now also multithreaded to deal with higher latency to some
sites.