(from github.com/jdeathe)
Steps to reproduce:
- Setup a Web Crawling Configuration with the following rules
- URLs
http://fess.codelibs.org/
- Included URLs For Crawling
http://fess.codelibs.org/.*
- Excluded URLs For Crawling
NULL
- Included URLs For Indexing:
http://fess.codelibs.org/.*
- Excluded URLs For Indexing:
http://fess.codelibs.org/dev/.* http://fess.codelibs.org/apidocs/.*
-
Start the Crawl
-
After results are generated search for
api
. -
Results include content from
fess.codelibs.org/apidocs/overview-summary.html
.
Expected outcome:
-
The “Included URLs For Indexing” rule should limit the indexed results to content hosted on
http://fess.codelibs.org/
. -
The “Excluded URLs For Indexing” rule should prevent the /apidocs and /dev content from being indexed.
Actual outcome:
The outcome of 1 is true and results are limited to the expected host domain however the rule appears to invalidate point 2; with results for /apidocs and /dev getting included in the index.
Please advise if this is a bug or an issue with the Regular Expression rules in use?