(from github.com/micakovic)
When creating a new crawler, it is relatively easy to exclude certain paths from crawling and indexing. For example, for http://example.com do not crawl or index http://example.com/private/ could be achieved by adding the following to Excluded URLs For Crawling and Excluded URLs For Indexing:
Or possibly:
/private/.*
However, things get more complicated and much more confusing when trying to do more complex inclusions or exclusions and match them to labels.
Let’s assume that I have the following web crawler:
URLs: http://example.com
Excluded URLs For Crawling:
Excluded URLs For Crawling:
http://example.com/private/.*
That’s fine. Now, let’s assume that I want to offer options in the search to limit search results to certain scopes when clicking on labels.
Let’s say that I want to have a label which is called ‘Actors’ and that clicking on that label should only display results for paths which are:
http://example.com/actors
http://example.com/actors/.*
http://example.com/.*/actors
http://example.com/.*/actors/.*
So, if someone clicks on the label ‘Actors’, they would see results for http://example.com/films/actors/123, http://example.com/series/actors/4546, but not for the rest of the site.
Included URLs for crawling and indexing can be specified in both the crawler and in the label. That is the first confusing bit. It is also not very clear from documentation which one does what.
So, in this case, should I have more than one crawler? One would crawl and index the entire site, the other would crawl and index only the included paths? Or should I have one crawler, and assign various labels with included and excluded paths to it?
I have attempted both solutions, and none produce accurate results.
In addition to /.*/, I also attempted these with equally inaccurate results.
.+/actors/.+$
.+/actors/$
^/actors/.+$
Negative lookahead rules in excluded paths also produce inaccurate results.
http://www.example.com/.+(?!actors/).+$
http://www.example.com/.+(?!actors/)$
http://www.example.com/(?!actors/).+$
I have tried URLs, paths, in either crawlers alone or labels alone or both crawlers and labels, and I always get false results in either URLs showing which ought not be be when filtered, or items not being found at all where they ought to be found.
Any ideas on how to tackle this scenario?