Include rules taking precedence over Exclude rules.

discuss · March 15, 2017, 1:11pm

(from github.com/jdeathe)
Steps to reproduce:

Setup a Web Crawling Configuration with the following rules

URLs
```
http://fess.codelibs.org/
```
Included URLs For Crawling
```
http://fess.codelibs.org/.*
```
Excluded URLs For Crawling
```
NULL
```
Included URLs For Indexing:
```
http://fess.codelibs.org/.*
```

Excluded URLs For Indexing:

http://fess.codelibs.org/dev/.*
http://fess.codelibs.org/apidocs/.*

Start the Crawl
After results are generated search for api.
Results include content from fess.codelibs.org/apidocs/overview-summary.html.

Expected outcome:

The “Included URLs For Indexing” rule should limit the indexed results to content hosted on http://fess.codelibs.org/.
The “Excluded URLs For Indexing” rule should prevent the /apidocs and /dev content from being indexed.

Actual outcome:

The outcome of 1 is true and results are limited to the expected host domain however the rule appears to invalidate point 2; with results for /apidocs and /dev getting included in the index.

Please advise if this is a bug or an issue with the Regular Expression rules in use?

discuss · March 16, 2017, 6:33am

(from github.com/marevol)
The processing order is “Excluded URLs For Indexing” -> “Included URLs For Indexing”.
So, try to set “Included URLs For Indexing” to empty.

discuss · March 16, 2017, 11:23am

(from github.com/jdeathe)
@marevol Thanks for confirming that. I have used that method and confirm the exclude rules are processed as expected after removing the include rule.