Included and excluded paths and labels

discuss · April 12, 2017, 9:32am

(from github.com/micakovic)
When creating a new crawler, it is relatively easy to exclude certain paths from crawling and indexing. For example, for http://example.com do not crawl or index http://example.com/private/ could be achieved by adding the following to Excluded URLs For Crawling and Excluded URLs For Indexing:

http://example.com/private/.*

Or possibly:

/private/.*

However, things get more complicated and much more confusing when trying to do more complex inclusions or exclusions and match them to labels.

Let’s assume that I have the following web crawler:

URLs: http://example.com
Excluded URLs For Crawling:
Excluded URLs For Crawling:
    http://example.com/private/.*

That’s fine. Now, let’s assume that I want to offer options in the search to limit search results to certain scopes when clicking on labels.

Let’s say that I want to have a label which is called ‘Actors’ and that clicking on that label should only display results for paths which are:

http://example.com/actors
http://example.com/actors/.*
http://example.com/.*/actors
http://example.com/.*/actors/.*

So, if someone clicks on the label ‘Actors’, they would see results for http://example.com/films/actors/123, http://example.com/series/actors/4546, but not for the rest of the site.

Included URLs for crawling and indexing can be specified in both the crawler and in the label. That is the first confusing bit. It is also not very clear from documentation which one does what.

So, in this case, should I have more than one crawler? One would crawl and index the entire site, the other would crawl and index only the included paths? Or should I have one crawler, and assign various labels with included and excluded paths to it?

I have attempted both solutions, and none produce accurate results.

In addition to /.*/, I also attempted these with equally inaccurate results.

.+/actors/.+$
.+/actors/$
^/actors/.+$

Negative lookahead rules in excluded paths also produce inaccurate results.

http://www.example.com/.+(?!actors/).+$
http://www.example.com/.+(?!actors/)$
http://www.example.com/(?!actors/).+$

I have tried URLs, paths, in either crawlers alone or labels alone or both crawlers and labels, and I always get false results in either URLs showing which ought not be be when filtered, or items not being found at all where they ought to be found.

Any ideas on how to tackle this scenario?

discuss · April 13, 2017, 12:17am

(from marevol (Shinsuke Sugaya) · GitHub)

but not for the rest of the site.

Disabling “Check Last Modified”, and then could you start crawler again?

should I have more than one crawler?

It depends on requirements, but in your case, I think one crawling config is better.

If you want to crawl/index http://example.com with Actors label, I think the config is as below:

Web Config:
URLs: http://example.com
Included URLs For Crawling:
  http://example.com/.*
Excluded URLs For Crawling:
  http://example.com/private/.*
Labels: (Not selected)

Label:
Included Paths:
  http://example.com/actors
  http://example.com/actors/.*
  http://example.com/.*/actors
  http://example.com/.*/actors/.*

discuss · May 16, 2017, 2:16pm

(from github.com/micakovic)
This trick does not work, I’m afraid. Doing a single crawler with multiple labels that look like the one described above return no results when the expected result sits somewhere in the label domain.

discuss · October 10, 2017, 9:52am

(from github.com/micakovic)
See #1228