Multiple Crawlers with includes/excludes on same path

profm · February 2, 2022, 11:43am

Hi @all,

is it possible to define multiple crawlers with different includes excludes on the same path?
The background why we would like to split this is the mass of files to crawl & index.
We previously had a single job that included the entire path but it kept running into an OOM error and didn’t even run through without problems.

For this reason the question is now if the following construct works e.g.:

Job a:
Path: smb://mysrv/data/
Includes Crawling: ^(.?(?:(A|a)_ ?(A|a).).$
Excludes: Crawling ..(?i)(db|tmp|lnk|inf)
Includes Indexing: .*.(?i)(pdf|xlsx|xlsm|xls|pptx|pptm|ppt|docx|docm|doc|rtf|vsd|odt|csv|txt)

Job b:
Path: smb://mysrv/data/
Includes Crawling: ^(.?(?:(B|b)_ ?(B|b).).$
Excludes: Crawling ..(?i)(db|tmp|lnk|inf)
Includes Indexing: .*.(?i)(pdf|xlsx|xlsm|xls|pptx|pptm|ppt|docx|docm|doc|rtf|vsd|odt|csv|txt)

Currently it looks like only one definition is working - the jobs are running but do not show any crawled or indexed documents.

Or any other suggestion how to crawl large fileservers?

Thanks!

shinsuke · February 2, 2022, 11:42pm

I think it’s better to increase the heap size(-Xmx) for a crawler in fess_config.properties.

profm · February 4, 2022, 12:11pm

Any recommendation for setting Xms/Xmx values?
Currently I use for crawler:

JVM options

-Xms16g\n
-Xmx32g\n\

We now moved to a 340+ GB Memory Server. So Memory should not be the problem.

Thanks

shinsuke · February 4, 2022, 1:06pm

It depends on your crawling requirement.

profm · February 4, 2022, 1:47pm

Currently were indexing only files with these extensions:
pdf|xlsx|xlsm|xls|pptx|pptm|ppt|docx|docm|doc|rtf|vsd|odt|csv|txt

We would like to crawl large smb based fileservice with approx 1300 sub directories and 16 TB of data.
The whole crawling process can take some time we don’t need daily up2date results. Would be fine if results are up2date every 3 or 4 days.