Proper crawler setup

discuss · February 1, 2019, 6:02am

(from github.com/abolotnov)
If I want to crawl only *.example.com (both http and https) and want to exclude images, js files, css files - what should my crawler setup look like?

I tried many combinations but I either get external sites crawled or only http and not https and looks like excluded urls are overwritten by included so I managed to keep crawler to stick with the domain more or less but can’t make it ignore unwanted files.

My setup looks like this:

Thank you!

discuss · February 1, 2019, 1:53pm

(from github.com/marevol)
Did you check fess-crawler.log?

discuss · February 1, 2019, 2:22pm

(from github.com/abolotnov)
What should I be looking for? I think the config works as expected but I can’t figure how to:

Limit the crawler to collect *.example.com - right now based on my config it will only collect www.example.com
Exclude the unwanted filetypes - these files still get indexed

discuss · February 1, 2019, 9:13pm

(from marevol (Shinsuke Sugaya) · GitHub)

Exclude the unwanted filetypes - these files still get indexed

1 line is 1 regex.

discuss · February 1, 2019, 10:40pm

(from github.com/abolotnov)
I will break them into multiple lines. Is there a way to limit crawling to *.example.com so www.example.com and more.example.com both get indexed?

discuss · February 2, 2019, 7:30am

(from github.com/marevol)
https?://.*\.example\.com/.*
It’s Java regex.