(from github.com/charles-pinkston)
My crawler is still taking a long time to crawl my site. When I look through the fess-crawler.log file, I’m seeing the same URL being crawled/indexed a number of times. I have my crawler set to use 25 threads (Crawler > Web), and 5 Simultaneous Crawler Config (System > General). I’m seeing the same page at most 25 times, so I think it has something to do with the number of threads. Is there any way to force Fess to skip already crawled/indexed pages?
On a similar issue, I’m seeing a lot of redirects. We have a URL re-write rule that enforces use of https and a trailing slash on URLs. Is there any way to use the Pattern Match to change the crawler to change the crawler to do those re-writes before the pages are indexed?