(from github.com/charles-pinkston)
This is a follow up to Issue 1870 - I’ve been focusing on other tasks for a while and just getting back to looking at this.
I have my Fess crawler configured to run 5 threads (Crawler > Web) and 5 Simultaneous Crawlers (System > General). Per Issue 1420, I’m seeing multiple (up to 25) documents in the .crawler.queue
index.
My crawls (roughly 28k pages) often take well over 24 hours.
In my fess_config.properties
I’ve set:
crawler.document.cache.enabled=false
index.number_of_shards=15
I’m trying to understand how the crawler and index work together. Is it attempting to index each one of the records in the .cralwer.queue, (i.e. does it try to index each page up to 25 times)? Would it be better if I was to drop say ‘Threads’ to 1 so it would just record each page 5 times?