When crawler is executed on the SMB server of VM from fess installed in VM,
The first time it can be collected without any problems, but if it is analyzed multiple times after that, only some files are collected twice.
Do you have any thoughts on similar events or causes?
I checked an existing issue but couldn’t find it.
The environment is as follows.
■ Environment
fess: 12.3.0
elasticsearch: 6.4.0 (internal fess)
Target file extension: .xlsx
(from github.com/tanayoshi1002)
When I checked fess-crawler.log about a file that was registered multiple times,
We confirmed that logs were collected as usual without displaying Not Modified even though no changes were made to the file. 2019-09-27 18: 32: 42,903 [Crawler-20190717183000-10-2] INFO Crawling URL: smb: // url ...
In addition, the following INFO was seen around the target log. Is there anything related to the event?
The number of files collected in one crawl is about 500, and the collection frequency is about 10 minutes. 2019-09-27 18: 33: 24,729 [IndexUpdater] INFO Processing 11/16 docs (Doc: {access 3ms}, Mem: {used 169MB, heap 255MB, max 1007MB})
Change the maximum file size to crawl.(10M to 20M)
\fess-12.3.0\app\WEB-INF\classes\crawler\contentlength.xml <property name="defaultMaxLength">20971520</property><!-- 20M -->
Run fess.bat.
Access FESS from the browser and change the following settings.
※Uncreated value is not changed in newly created.
(from github.com/tanayoshi1002)
I’m sorry, I tried again with Java 11, but it was reproduced.
There are circumstances and you can only try with embedded elastic search, but check the debug log.
Add something if you know something.