Included and excluded paths and labels

When creating a new crawler, it is relatively easy to exclude certain paths from crawling and indexing. For example, for do not crawl or index could be achieved by adding the following to Excluded URLs For Crawling and Excluded URLs For Indexing:*

Or possibly:


However, things get more complicated and much more confusing when trying to do more complex inclusions or exclusions and match them to labels.

Let’s assume that I have the following web crawler:

Excluded URLs For Crawling:
Excluded URLs For Crawling:*

That’s fine. Now, let’s assume that I want to offer options in the search to limit search results to certain scopes when clicking on labels.

Let’s say that I want to have a label which is called ‘Actors’ and that clicking on that label should only display results for paths which are:**/actors*/actors/.*

So, if someone clicks on the label ‘Actors’, they would see results for,, but not for the rest of the site.

Included URLs for crawling and indexing can be specified in both the crawler and in the label. That is the first confusing bit. It is also not very clear from documentation which one does what.

So, in this case, should I have more than one crawler? One would crawl and index the entire site, the other would crawl and index only the included paths? Or should I have one crawler, and assign various labels with included and excluded paths to it?

I have attempted both solutions, and none produce accurate results.

In addition to /.*/, I also attempted these with equally inaccurate results.


Negative lookahead rules in excluded paths also produce inaccurate results.!actors/).+$!actors/)$!actors/).+$

I have tried URLs, paths, in either crawlers alone or labels alone or both crawlers and labels, and I always get false results in either URLs showing which ought not be be when filtered, or items not being found at all where they ought to be found.

Any ideas on how to tackle this scenario?


but not for the rest of the site.

Disabling “Check Last Modified”, and then could you start crawler again?

should I have more than one crawler?

It depends on requirements, but in your case, I think one crawling config is better.

If you want to crawl/index with Actors label, I think the config is as below:

Web Config:
Included URLs For Crawling:*
Excluded URLs For Crawling:*
Labels: (Not selected)

Included Paths:**/actors*/actors/.*

This trick does not work, I’m afraid. Doing a single crawler with multiple labels that look like the one described above return no results when the expected result sits somewhere in the label domain.

See #1228