Slow crawl - Crawler Queue - Multiple records per URL

(from github.com/charles-pinkston)
This is a follow up to Issue 1870 - I’ve been focusing on other tasks for a while and just getting back to looking at this.

I have my Fess crawler configured to run 5 threads (Crawler > Web) and 5 Simultaneous Crawlers (System > General). Per Issue 1420, I’m seeing multiple (up to 25) documents in the .crawler.queue index.
My crawls (roughly 28k pages) often take well over 24 hours.

In my fess_config.properties I’ve set:

  • crawler.document.cache.enabled=false
  • index.number_of_shards=15

I’m trying to understand how the crawler and index work together. Is it attempting to index each one of the records in the .cralwer.queue, (i.e. does it try to index each page up to 25 times)? Would it be better if I was to drop say ‘Threads’ to 1 so it would just record each page 5 times?

(from github.com/marevol)
What is your server spec? and did you check fess-crawler.log?

(from github.com/charles-pinkston)
Here is a scrubbed example of what I’m seeing lots of in my fess-crawler.log - it’s a set of 13 documents that appear to be being re-crawled 4 times.

Gist

Mixed in with these types of lines, I see a lot of records that read like:

2019-08-23 01:55:54,142 [IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 12ms}, Mem:{used 1GB, heap 2GB, max 4GB})

or

2019-08-23 01:55:58,122 [CoreLib-TimeoutManager] INFO [SYSTEM MONITOR] {"os":{"memory":{"physical":{"free":5131128832,"total":33565925376},"swap_space":{"free":5966655488,"total":6442446848}},"cpu":{"percent":8},"load_averages":[0.96, 0.96, 1.31]},"process":{"file_descriptor":{"open":380,"max":1048576},"cpu":{"percent":0,"total":3417340},"virtual_memory":{"total":10622029824}},"jvm":{"memory":{"heap":{"used":1171142704,"committed":2208976896,"max":5298978816,"percent":22},"non_heap":{"used":200336960,"committed":207163392}},"pools":{"direct":{"count":56,"used":270876673,"capacity":270876672},"mapped":{"count":0,"used":0,"capacity":0}},"gc":{"young":{"count":3186,"time":50595},"old":{"count":43,"time":5445}},"threads":{"count":64,"peak":65},"classes":{"loaded":15980,"total_loaded":16241,"unloaded":261},"uptime":93357653},"elasticsearch":null,"timestamp":1566525358122}

(from marevol (Shinsuke Sugaya) · GitHub)

2019-08-23 01:52:11,536 [Crawler-20190822000000-1-2] INFO Crawling URL: mywebsite.com is available for purchase - Sedo.com
2019-08-23 01:52:11,545 [Crawler-20190822000000-1-2] INFO Redirect to URL: https://mywebsite.com/what-if-more-than-one-user-is-involved/

a tag in your page specifies https://mywebsite.com/what-if-more-than-one-user-is-involved, but the actual url is https://mywebsite.com/what-if-more-than-one-user-is-involved/.
So, the web server redirects to it.
It’s better your page to use https://mywebsite.com/what-if-more-than-one-user-is-involved/ at a tag.

(from github.com/charles-pinkston)
Thanks for the response. That makes sense, but does introduce a bit of a problem for how we can update all of the links on our site to include a trailing /.

There are two approaches that might work for us, but I’m not sure how Fess would deal with them:

  • We could potentially add a JavaScript file to append the slash, but I’m not sure if Fess crawls the URLs after JS has loaded.
  • We could potentially add a canonical URL link to the specific pages. Does Fess honor those links?
    – e.g. <link rel="canonical" href="https://mywebsite.com/what-if-more-than-one-user-is-involved/" />

(from github.com/marevol)

  • Fess does not load JS file.
  • Fess deals with canonical, but canonical is processed after redirected.