Slow crawl/Index - same page crawled multiple times and redirects

(from github.com/charles-pinkston)
My crawler is still taking a long time to crawl my site. When I look through the fess-crawler.log file, I’m seeing the same URL being crawled/indexed a number of times. I have my crawler set to use 25 threads (Crawler > Web), and 5 Simultaneous Crawler Config (System > General). I’m seeing the same page at most 25 times, so I think it has something to do with the number of threads. Is there any way to force Fess to skip already crawled/indexed pages?

On a similar issue, I’m seeing a lot of redirects. We have a URL re-write rule that enforces use of https and a trailing slash on URLs. Is there any way to use the Pattern Match to change the crawler to change the crawler to do those re-writes before the pages are indexed?

(from github.com/marevol)

I’m seeing the same URL being crawled/indexed a number of times.

Could you attach fess-crawler.log?

Is there any way to use the Pattern Match

See Path Mapping.

(from github.com/zackhorvath)
We had a similar issue, but I ended up resolving it by getting aggressive with included and excluded URLs for crawling. We do HTTPS redirect on a load balancer, and we were getting errors when Fess was reaching out via HTTP. I pasted a snippet of our configuration below, hope it helps!

Included URLs:
https://www.fredhutch.org/.*

Excluded URLs:

.*amp.*
.*\.gif
.*\.jpg
.*\.jpeg
.*\.jpe
.*\.pcx
.*\.png
.*\.tiff
.*\.bmp
.*\.ics
.*\.msg
.*\.css
.*\.js
http://www.fredhutch.org/.*