The event that a file is collected twice by fess occurs.

(from github.com/tanayoshi1002)
Hi marevol,

When crawler is executed on the SMB server of VM from fess installed in VM,
The first time it can be collected without any problems, but if it is analyzed multiple times after that, only some files are collected twice.
Do you have any thoughts on similar events or causes?
I checked an existing issue but couldn’t find it.
The environment is as follows.
■ Environment
fess: 12.3.0
elasticsearch: 6.4.0 (internal fess)
Target file extension: .xlsx

Thanks
tanayoshi

(from github.com/marevol)
Did you check fess-crawler.log?

(from github.com/tanayoshi1002)
When I checked fess-crawler.log about a file that was registered multiple times,
We confirmed that logs were collected as usual without displaying Not Modified even though no changes were made to the file.
2019-09-27 18: 32: 42,903 [Crawler-20190717183000-10-2] INFO Crawling URL: smb: // url ...

In addition, the following INFO was seen around the target log. Is there anything related to the event?
The number of files collected in one crawl is about 500, and the collection frequency is about 10 minutes.
2019-09-27 18: 33: 24,729 [IndexUpdater] INFO Processing 11/16 docs (Doc: {access 3ms}, Mem: {used 169MB, heap 255MB, max 1007MB})

(from github.com/marevol)
I could not reproduce it. Could you provide steps to reproduce it?

(from github.com/tanayoshi1002)
The following steps.

  1. Store more than 500 files on the file server.
  2. Change the maximum file size to crawl.(10M to 20M)
    \fess-12.3.0\app\WEB-INF\classes\crawler\contentlength.xml
    <property name="defaultMaxLength">20971520</property><!-- 20M -->
  3. Run fess.bat.
  4. Access FESS from the browser and change the following settings.
    ※Uncreated value is not changed in newly created.
Settings Item Value
General - Crawler Delete previous document -1
Scheduler - Default Crawler Schedule */15 * * * *
File Config(Create New) Name any
File Config(Create New) Path smb://url…
File Config(Create New) Search target path ..ppt$ ..pptx$ ..doc$ ..docx$ ..xls$ ..xlsx$ ..xlsm$ ..pdf$ ..txt$ ..csv$ ..xml$ ..html$ ..js$ ..c$ ..h$ ..java$ ..hpp$ ..cpp$ ..gz$ ..tar$ .*.zip$
File Auth(Create New) Host name Target host name
File Auth(Create New) Scheme Samba
File Auth(Create New) User name User name that can access the target
File Auth(Create New) Password User password
File Auth(Create New) File crawl settings Crawler created with File Config
  1. Left for a few days.
    ※I’m not sure if it’s related, but I’m restarting my PC only once.

(from github.com/marevol)
Is it reproduced in the latest version?

(from github.com/tanayoshi1002)
It reproduces. I confirmed with the following versions.
FESS: 13.4.0
Elasticsearch: 7.4.0 (internal fess)
JDK: 13.0.1

(from marevol (Shinsuke Sugaya) · GitHub)
It’s not reproduced in our environment.
It’s better to check debug logs when it occurs.

Elasticsearch: 7.4.0 (internal fess)

Embedded elasticsearch is not suitable for an evaluation.

JDK: 13.0.1

Fess 13 supports only Java 11.

(from github.com/tanayoshi1002)
I’m sorry, I tried again with Java 11, but it was reproduced.
There are circumstances and you can only try with embedded elastic search, but check the debug log.
Add something if you know something.