I use FESS to crawling local or smb shared files,say type .pdf but failed.
File Crawling Configuration belows:
Details
ID sGWHM48BEQDFG8pcj9ZH
Name A-shared
Paths file:/var/lib/fess/
Included Paths For Crawling ..(pdf|doc|docx|xls|xlsx|ppt|pptx|txt)$
Excluded Paths For Crawling
Included Paths For Indexing .?/var/lib/fess/..pdf$
.?/var/lib/fess/..txt$
.?/var/lib/fess/..doc$
.?/var/lib/fess/.*.docx$
Excluded Paths For Indexing
Config Parameters
Depth
Max Access Count
The number of Thread 5
Interval time 1000 ms
Boost 1.0
Permissions {role}guest
Virtual Hosts
Status Enabled
Description
and I start job in Job Scheduler but no result appears on the search front.
I check Crawling logs,
Crawling Information
Session ID 20240503083101
Crawler start time 2024-05-03T08:31:06.605+0000
Crawl start time (Web/File system) 2024-05-03T08:31:06.651+0000
Crawl exec time (Web/File system) 31305
Indexing exec time (Web/File system) 48
Index size (Web/File system) 0
Crawl end time (Web/File system) 2024-05-03T08:31:47.406+0000
Crawler status true
Crawler end time 2024-05-03T08:31:47.407+0000
Crawler exec time 40802
and found Index size (Web/File system) is zero.
I deploy FESS using docker-compose,say command
"
docker-compose --env-file .env.elasticsearch -f compose.yaml -f compose-elasticsearch8.yaml up -d
"
A directory is a target path that the crawler scans. Therefore, you should add the directory to the Included Paths For Crawling . However, the file protocol cannot determine whether a path is a directory. Thus, you should use the SMB protocol and add .*/ to the settings.
Many thanks to Shinsuke Sensei.
I tried several settings to search my pdf files but failed. My settings like below.Would you please tell me the right way to search my pdf files?
Shinsuke Sensei,I am really appreciated to your kindness.
I set parameters following your instruction,however, I still got no search results from net shared file containning pdf files.
I am confused about my settings and have no idea how to solve it.
Shinsuke Sensei,I am very glad to tell you I finally get the right search result.
However,I have no idea about the reason.
The most important thing is to thank you for you patient and kindness on helping me out.
If Sensei explain the reason why it works or add them to document, I would more appreciated.
Since I do not have enough information to reproduce the issue, I am not certain of the cause. If you encounter an unknown problem, it is better to check the log messages at the debug level. To change the log level, you can set .logLevel("debug") in the crawler job.