odd crawler behaviour - need help for troubleshooting

mamema · April 4, 2026, 2:51pm

i have a large data lake of files of every file extension possible.
To index large pdf files i have increased the max content length to 256MB

but the crawling chokes on something i cannot find out what exactly it is.

when the crawler hung i get these entries in fess:

log.logger=org.codelibs.fess.timer.SystemMonitorTarget

Processing 0/1005 docs (Doc:{access 360ms}, Mem:{used 987.722MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:31.098Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater

Processing 0/1005 docs (Doc:{access 459ms}, Mem:{used 987.748MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:36.557Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater

Processing 0/1005 docs (Doc:{access 400ms}, Mem:{used 987.95MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:41.957Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater

Processing 0/1005 docs (Doc:{access 372ms}, Mem:{used 987.982MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:47.330Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater

it stays indefinitly on 0/1005

then i can check the crawler logs, check the last directory crawled and use this path as a new
crawling root and can then crawl this path successfully, so my impression is, my files are not the culprit. but i may be wrong.

of course i can exclude stuff, i did this already with a huge extension list, but i want to understand what is going on and tackle the problem correctly.

Any help is greatly appreciated.

shinsuke · April 6, 2026, 1:49pm

You may want to enable DEBUG logging to get more detailed insight into where the crawler is getting stuck. Also, it is possible that OpenSearch is not accepting the documents (e.g., due to size limits, timeouts, or mapping issues), which can cause the indexing queue to stall at 0/1005.

mamema · April 8, 2026, 3:13pm

yes, you´re right, i was opensearch, perhaps an candidate for an FAQ entry but after raising this:
opensearch.xcontent.string.length.max=2147483647

crawling went through fine.
As the opensearch.xcontent.string.length.max setting is a safety setting i was able to raise this because fess isn´t exposed. changing this for exposed sites is not advised.