i have a large data lake of files of every file extension possible.
To index large pdf files i have increased the max content length to 256MB
but the crawling chokes on something i cannot find out what exactly it is.
when the crawler hung i get these entries in fess:
log.logger=org.codelibs.fess.timer.SystemMonitorTarget
Processing 0/1005 docs (Doc:{access 360ms}, Mem:{used 987.722MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:31.098Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater
Processing 0/1005 docs (Doc:{access 459ms}, Mem:{used 987.748MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:36.557Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater
Processing 0/1005 docs (Doc:{access 400ms}, Mem:{used 987.95MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:41.957Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater
Processing 0/1005 docs (Doc:{access 372ms}, Mem:{used 987.982MB, heap 16.384GB, max 16.384GB}) | @timestamp=2026-04-04T14:26:47.330Z log.level=INFO ecs.version=1.2.0 service.name=fess event.dataset=crawler process.thread.name=IndexUpdater log.logger=org.codelibs.fess.indexer.IndexUpdater
it stays indefinitly on 0/1005
then i can check the crawler logs, check the last directory crawled and use this path as a new
crawling root and can then crawl this path successfully, so my impression is, my files are not the culprit. but i may be wrong.
of course i can exclude stuff, i did this already with a huge extension list, but i want to understand what is going on and tackle the problem correctly.
Any help is greatly appreciated.