(from github.com/shwetazilpe)
Hi,
I need your expert advice in how to further debug the issue or how to crawl particular URL and exclude it from indexing.
I created a web crawler with following configuration:
URLs: https://example.com/
https://example.com/alltutoriallist
Included URLs For Crawling : https://example.com/.*
Excluded URLs For Crawling : (?i)..(css|css?.js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$
Included URLs For Indexing : https://example.com/.
Excluded URLs For Indexing : (?i)..(css|css?.*js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$
https://example.com/alltutoriallist
https://example.com/alltutoriallist has links to the tutorials.
I want to crawl all the url’s under domain “example.com” and also https://example.com/alltutoriallist but do not want the url “https://example.com/alltutoriallist” to be indexed in ElasticSearch or say do not need it in fess results. I only need urls listed in /alltutoriallist, but I’m also getting this url in search result. How to exclude it?
Any particular configuration I should check?
Thanks in advance,
Shweta