Exclude url from indexing only

discuss · July 8, 2019, 2:03pm

(from github.com/shwetazilpe)
Hi,
I need your expert advice in how to further debug the issue or how to crawl particular URL and exclude it from indexing.

I created a web crawler with following configuration:
URLs: https://example.com/
https://example.com/alltutoriallist
Included URLs For Crawling : https://example.com/.*
Excluded URLs For Crawling : (?i)..(css|css?.js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$
Included URLs For Indexing : https://example.com/.
Excluded URLs For Indexing : (?i)..(css|css?.*js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$
https://example.com/alltutoriallist

https://example.com/alltutoriallist has links to the tutorials.
I want to crawl all the url’s under domain “example.com” and also https://example.com/alltutoriallist but do not want the url “https://example.com/alltutoriallist” to be indexed in ElasticSearch or say do not need it in fess results. I only need urls listed in /alltutoriallist, but I’m also getting this url in search result. How to exclude it?

Any particular configuration I should check?

Thanks in advance,
Shweta

discuss · July 8, 2019, 8:53pm

(from github.com/marevol)

URLs: 
https://example.com/alltutoriallist

Included URLs For Crawling : 
https://example.com/.*

Excluded URLs For Crawling : 
(?i).*(css|css?.js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$

Included URLs For Indexing : 

Excluded URLs For Indexing : 
https://example.com/alltutoriallist

and then check fess-crawler.log.

discuss · July 9, 2019, 11:02am

(from github.com/shwetazilpe)
It worked…Thank You @marevol.