Slow crawl - Crawler Queue - Multiple records per URL

This is a follow up to Issue 1870 - I’ve been focusing on other tasks for a while and just getting back to looking at this.

I have my Fess crawler configured to run 5 threads (Crawler > Web) and 5 Simultaneous Crawlers (System > General). Per Issue 1420, I’m seeing multiple (up to 25) documents in the .crawler.queue index.
My crawls (roughly 28k pages) often take well over 24 hours.

In my I’ve set:

  • crawler.document.cache.enabled=false
  • index.number_of_shards=15

I’m trying to understand how the crawler and index work together. Is it attempting to index each one of the records in the .cralwer.queue, (i.e. does it try to index each page up to 25 times)? Would it be better if I was to drop say ‘Threads’ to 1 so it would just record each page 5 times?

What is your server spec? and did you check fess-crawler.log?

Here is a scrubbed example of what I’m seeing lots of in my fess-crawler.log - it’s a set of 13 documents that appear to be being re-crawled 4 times.


Mixed in with these types of lines, I see a lot of records that read like:

2019-08-23 01:55:54,142 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 12ms}, Mem:{used 1GB, heap 2GB, max 4GB})


2019-08-23 01:55:58,122 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":5131128832,"total":33565925376},"swap_space":{"free":5966655488,"total":6442446848}},"cpu":{"percent":8},"load_averages":[0.96, 0.96, 1.31]},"process":{"file_descriptor":{"open":380,"max":1048576},"cpu":{"percent":0,"total":3417340},"virtual_memory":{"total":10622029824}},"jvm":{"memory":{"heap":{"used":1171142704,"committed":2208976896,"max":5298978816,"percent":22},"non_heap":{"used":200336960,"committed":207163392}},"pools":{"direct":{"count":56,"used":270876673,"capacity":270876672},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":3186,"time":50595},"old":{"count":43,"time":5445}},"threads":{"count":64,"peak":65},"classes":{"loaded":15980,"total_loaded":16241,"unloaded":261},"uptime":93357653},"elasticsearch":null,"timestamp":1566525358122}


2019-08-23 01:52:11,536 [Crawler-20190822000000-1-2] INFO Crawling URL:
2019-08-23 01:52:11,545 [Crawler-20190822000000-1-2] INFO Redirect to URL:

a tag in your page specifies, but the actual url is
So, the web server redirects to it.
It’s better your page to use at a tag.

Thanks for the response. That makes sense, but does introduce a bit of a problem for how we can update all of the links on our site to include a trailing /.

There are two approaches that might work for us, but I’m not sure how Fess would deal with them:

  • We could potentially add a JavaScript file to append the slash, but I’m not sure if Fess crawls the URLs after JS has loaded.
  • We could potentially add a canonical URL link to the specific pages. Does Fess honor those links?
    – e.g. <link rel="canonical" href="" />


  • Fess does not load JS file.
  • Fess deals with canonical, but canonical is processed after redirected.