How to skip a specific page?

discuss · March 21, 2017, 9:05am

(from github.com/pcolmer)
We have a page on our wiki that fess cannot index because the page is too long. I’d like to configure the crawler to ignore that page.

The URL of the page is https://wiki.linaro.org/WordIndex

In the web crawling configuration, I’ve got the following:

URLs:
https://wiki.linaro.org/

Included URLs for crawling:
https://wiki.linaro.org/.*

Excluded URLs for crawling:
./.?.*
./..png
./..jpg
./..gif
./..ico
./..css
./..js
WordIndex

but the crawler is still trying to access that page. Do I need to add it to “Excluded URLs for indexing” as well?

Or have I got the syntax wrong?

Thanks.

Philip

discuss · March 21, 2017, 1:08pm

(from marevol (Shinsuke Sugaya) · GitHub)

This Wiki service has been archived*

Try to remove the above setting.
Specifying both included and excluded urls, included urls wins.

discuss · March 21, 2017, 1:47pm

(from github.com/pcolmer)
If I remove that setting, does fess default to crawling the base URL anyway?

discuss · March 21, 2017, 2:17pm

(from github.com/marevol)
Oops, for crawling, Excluded URLs wins.
I think your setting is:

URLs:
https://wiki.linaro.org/

Included URLs for crawling:
https://wiki.linaro.org/.*

Excluded URLs for crawling:
.*\?.*
.*\.png
.*\.jpg
.*\.gif
.*\.ico
.*\.css
.*\.js

Excluded URLs for indexing: (if you want to crawl https://wiki.linaro.org/WordIndex)
.*/WordIndex.*