Setting maxContentLength in Config Parameters of File System crawler is erratic

discuss · October 7, 2017, 7:33pm

(from github.com/doct)
Setting client.maxContentLength=100000000 in Config Parameters of File Crawling Configuration only affects some of the crawled files, but not others. It is not clear, when the value is used, and when not.

Applying the configured value does not depend on file type.

This observation is based on entries in the “Failure URL” System Info section.

Edit: reducing the client.maxContentLength makes the non-default value being applied more often. I suspect this to be a java vm memory issue, however, can’t find any error log entries regarding this.

Edit 2: doubling FESS_MIN_MEM to 512m and quadrupling FESS_MAX_MEM to 4g doesn’t seem to have any impact on the issue.

Properties for bug report:
file.separator=
file.encoding=UTF-8
java.runtime.version=9+181
java.vm.info=mixed mode
java.vm.name=Java HotSpot™ 64-Bit Server VM
java.vm.vendor=Oracle Corporation
java.vm.version=9+181
os.arch=amd64
os.name=Windows 10
os.version=10.0
user.country=GB
user.language=en
user.timezone=Europe/Berlin
suggest.document=true
purge.searchlog.day=-1
thumbnail.enabled=false
append.query.parameter=false
search.log=false
web.api.popularword=true
purge.userinfo.day=-1
purge.suggest.searchlog.day=30
purge.joblog.day=-1
purge.by.bots=Crawler,crawler,Bot,bot,Slurp,Yeti,Baidu,Steeler,ichiro,hotpage,Feedfetcher,ia_archiver,Y!J-BRI,Google Desktop,Seznam,Tumblr,YandexBot,Chilkat,CloudFront,Mediapartners,MSIE 6
login.link.enabled=true
user.info=false
user.favorite=false
login.required=false
result.collapsed=false
crawling.thread.count=5
ldap.memberof.attribute=memberOf
csv.file.encoding=UTF-8
crawling.incremental=true
web.api.json=true
day.for.cleanup=3
failure.countthreshold=-1
suggest.searchlog=true

discuss · October 8, 2017, 1:25am

(from github.com/marevol)
Did you change app/WEB-INF/classes/crawler/contentlength.xml?
The minimum value of them is applied.

discuss · October 8, 2017, 10:02am

(from github.com/doct)
Thanks, that worked out fine!

Leaves the question, why the Config Parameter setting is only applied sporadically, though.

discuss · October 8, 2017, 1:22pm

(from github.com/marevol)
client.maxContentLength is applied to all documents, and it is checked before contentlength.xml.

discuss · October 8, 2017, 1:30pm

(from doct · GitHub)

client.maxContentLength is applied to all documents

Point is, it wasn’t applied to all documents, just some.

discuss · October 8, 2017, 1:42pm

(from github.com/marevol)
I could not reproduce it.
What kind of a file type is not applied?

discuss · October 8, 2017, 2:17pm

(from github.com/doct)
Not applying client.maxContentLength did not depend on the file type. One could have assumed, some file types it might have been applied to and others not. That was not the case. Instead, the configured client.maxContentLength was sometimes applied, sometimes not.

In the meantime, I have removed Fess. Thanks for your support.

discuss · January 18, 2018, 3:52pm

(from github.com/rodrigobml)
I need to index very large documents, 5GB. I just need to change the contentlength.xml and -xms and -xmx in fess_config.properties?

discuss · October 20, 2018, 12:47pm

(from 15738519635 (LuckyBird) · GitHub)

I need to index very large documents, 5GB. I just need to change the contentlength.xml and -xms and -xmx in fess_config.properties?

@rodrigobml
do you index your large documents ,like 5GB, how do it