Explicitly exclude urls with .. in search crawling

There were per-site configured rules defined but the regexp was slightly
incorrectly defined. However, we should just simply never crawl urls
like this unless they are normalized, so for now just add them to the
hardcoded exclusion rules.
This commit is contained in:
Magnus Hagander
2017-11-08 12:02:58 -05:00
parent 01846cefc9
commit 4ce8184e65

View File

@ -31,6 +31,8 @@ class GenericSiteCrawler(BaseSiteCrawler):
self.queue.put((x, 0.5, False))
def exclude_url(self, url):
if ".." in url:
return True
if self.robots and self.robots.block_url(url):
return True
for r in self.extra_excludes: