Include rules taking precedence over Exclude rules.

(from github.com/jdeathe)
Steps to reproduce:

  1. Setup a Web Crawling Configuration with the following rules
  • URLs
    http://fess.codelibs.org/
    
  • Included URLs For Crawling
    http://fess.codelibs.org/.*
    
  • Excluded URLs For Crawling
    NULL
    
  • Included URLs For Indexing:
    http://fess.codelibs.org/.*
    
  • Excluded URLs For Indexing:
    http://fess.codelibs.org/dev/.*
    http://fess.codelibs.org/apidocs/.*
    
  1. Start the Crawl

  2. After results are generated search for api.

  3. Results include content from fess.codelibs.org/apidocs/overview-summary.html.

Expected outcome:

  1. The “Included URLs For Indexing” rule should limit the indexed results to content hosted on http://fess.codelibs.org/.

  2. The “Excluded URLs For Indexing” rule should prevent the /apidocs and /dev content from being indexed.

Actual outcome:

The outcome of 1 is true and results are limited to the expected host domain however the rule appears to invalidate point 2; with results for /apidocs and /dev getting included in the index.

Please advise if this is a bug or an issue with the Regular Expression rules in use?

(from github.com/marevol)
The processing order is “Excluded URLs For Indexing” -> “Included URLs For Indexing”.
So, try to set “Included URLs For Indexing” to empty.

(from github.com/jdeathe)
@marevol Thanks for confirming that. I have used that method and confirm the exclude rules are processed as expected after removing the include rule.