Crawling Configuration Recommendations

discuss · November 7, 2019, 12:32pm

(from github.com/AraGenji)
Hi @marevol,
i have multiple Windows fileshares with about 700.000 files to crawl in total and
the last few days i experimented with the number of shards for .crawler and fess indices, The number of Thread, interval time, simultaneous crawler config and turned of document cache as stated by you in Ticket #1716 .
As of now it would take about a week to do the initial crawl. Is this the estimated performance of fess or did i do something completely wrong? My fess is running on a CentOS 7 VM with 4 cpu cores and 12GB of RAM, i would be really thankfull if you could give me some numbers i could dial in to those settings i mentioned above to hopefully make things a little bit faster

discuss · November 8, 2019, 9:55am

(from github.com/AraGenji)
OK i changed the following in fess_config.properties and now the performance is aceptable.

indexer.webfs.update.interval=10000 --> indexer.webfs.update.interval=100
indexer.unprocessed.document.size=10 --> indexer.unprocessed.document.size=1000

but the sending process seems to be a bottleneck in my case or is it normal that the crawl finishes and after the crawl the data gets moved to the fess index?

discuss · November 9, 2019, 1:25am

(from github.com/marevol)
Crawler performance depends on a lot of factors, such as settings, server spec, network,…
So, we have no single solution. The above setting and https://github.com/codelibs/fess/issues/1716#issuecomment-398908408 are some of them.
If you need more supports for your environment, please contact commercial support.