Proper crawler setup

(from github.com/abolotnov)
If I want to crawl only *.example.com (both http and https) and want to exclude images, js files, css files - what should my crawler setup look like?

I tried many combinations but I either get external sites crawled or only http and not https and looks like excluded urls are overwritten by included so I managed to keep crawler to stick with the domain more or less but can’t make it ignore unwanted files.

My setup looks like this:

Thank you!

(from github.com/marevol)
Did you check fess-crawler.log?

(from github.com/abolotnov)
What should I be looking for? I think the config works as expected but I can’t figure how to:

  • Limit the crawler to collect *.example.com - right now based on my config it will only collect www.example.com
  • Exclude the unwanted filetypes - these files still get indexed

(from marevol (Shinsuke Sugaya) · GitHub)

Exclude the unwanted filetypes - these files still get indexed

1 line is 1 regex.

(from github.com/abolotnov)
I will break them into multiple lines. Is there a way to limit crawling to *.example.com so www.example.com and more.example.com both get indexed?

(from github.com/marevol)
https?://.*\.example\.com/.*
It’s Java regex.