Append labels

(from github.com/micakovic)
Is it possible to have multiple web crawlers append labels for to existing documents, rather than overwrite them?

Scenario:

Labels: films, actors, website

Web crawler called ‘films’ gets http://site.com/page1 and labels it with films, website.

page1 now appears in searches only if the label is films, or the label is website.

Web crawler called ‘actors’ runs now and finds the same page, then labels it with actors, website.

page1 now appears in searches only if the label is actors, or the label is website, but not if someone searches this page under the label films. This is now wrong, as this page should be labelled with all three labels: films, actors, website.

(from github.com/marevol)
No. Multiple crawlers should use the same crawling setting.
I think it’s better to use Label’s Included Paths and not set Web Crawling’s Labels.

(from github.com/micakovic)
That makes it impossible in any website where paths are not predictable. For example, if you want to crawl www.site.com and have multiple crawlers to accommodate this:

  1. Crawler #1 labels ‘books’, ‘website’. Crawl site.com/util/books which displays a list of pages (generated dynamically from a database), then crawl with the depth of 1, each of the URLs will be crawled and labelled ‘books’, ‘website.’ These pages could appear anywhere on the website, and there is no way to predict whether it will be site.com/books/143, site.com/743, site.com/archives/forgotten/345, and so on.

  2. Crawler #2 crawls site.com/util/magazines. Same story as above, only each of the URLs is crawled with the depth of 1 and labelled ‘magazine’, ‘website’.

  3. Crawler #3 crawls the entire site, labels ‘website’.

2 will obliterate 1, and 3 will obliterate both 1 and 2.

I would like for users to be able to see all magazines when filtering by label ‘magazine’ The same for books, when filtering by label 'books. I would also like them to find both books and magazines when filtering by label ‘website’. Finally, I would like them to find all ‘magazines’ even if some magazines appear in both ‘magazines’ and ‘books’ labels.

What would be the suggested way to go about this?

(from marevol (Shinsuke Sugaya) · GitHub)

use Label’s Included Paths and not set Web Crawling’s Labels.

and then create ONE crawler for www.site.com.

(from github.com/micakovic)
And how would you dynamically populate included paths? Each time someone adds a new page, you would have to add it by hand to included paths? Each time someone removes the page, you would have to manually remove it? And if you have, say, 1M books, would you add 1M lines to included paths?

(from github.com/marevol)
How about putting all Java Regex patterns to Included Paths?

(from github.com/micakovic)
I cannot predict all the patterns, though. there could be hundreds, or thousands of pages, and they can be anywhere within the URL structure. A web page that goes like this site.com/something/else could be suitable to be labelled as a book, but site.com/something/other should not, and site.com/else/34234 should be book, and site.com/groups/seven should be both a book, and a magazine, and be part of the global site search.

To me, it seems like the only way to go about this is to have multiple crawlers for the same site, but each of the crawlers would append labels, not overwrite them.

Example: site.com

crawler ‘site - books’:
crawl url site/util/books, depth 1, label as ‘books’, ‘website’

crawler ‘site - magazines’
crawl url site/util/magazines, depth 1, label as ‘magazines’, if 'books for each URL exist do not overwrite, also append ‘website’ is not exist.

crawler ‘site - everywhere’
crawl site.com/ label ‘website’ for each page, if ‘books’ and/or ‘magazine’ label exists for each page do not overwrite, but append.

The outcome would be, for example:

The first crawler runs and indexes site.com/page1. It is labelled as 'books, ‘website’.
The second crawler runs and finds site.com/page1. It does nothing.
The third crawler, finds site.com/page1, wants to add ‘website’ to it, but it is already there.

site.com/page1 now has labels ‘books’, ‘website’.

The first crawler finds site.com/sub/sub/page2 does nothing.
The second crawler finds site.com/sub/sub/page2 does nothing.
The third crawler finds site.com/sub/sub/page2 labels it as 'website.

site.com/sub/sub/page2 is labelled as ‘website’.

The first crawler runs and finds site.com/2017/mag/old34-68. Label as ‘books’.
The second crawler also finds it and labels it as ‘magazine’, but leaves ‘book’ as well.
The third crawler finds it and labels it as ‘website’, leaving the first two intact.

site.com/2017/mag/old34-68 can now be found under labels ‘books’, ‘magazines’, ‘website’.

Like this, it becomes easy and manageable to categorise content on the same site, or even multiple sites, where URL patterns cannot be predicted and regex matched.

(from github.com/marevol)
Label for UI setting does not support it.
I think, to support the use case, it might work with additional field feature extracting from crawled pages.

(from github.com/micakovic)
That would be an option, too. How could one set values on an additional field feature in crawlers?

(from github.com/micakovic)
With the introduction of ordered crawling (#1290) it is possible to overcome this by doing the following.

System -> General -> Simultaneous Crawler Config = 1

Now make several crawlers and name them so that the more general ones (fewer labels) are on top. Example:

100 Website
210 Actors
220 Directors
310 Movies

Make these labels.

For each of these crawlers, supply the starting URL. Start 100 Website with the site URL. Provide a small utility for 210, 220, 310 by, for example, reading the database, selecting all pages which are ‘actors’, and printing one by one as href on the utility page. Label with both ‘website’, and ‘actors’.

Repeat for others.

This will label all pages as ‘website’ first. Then, it will relabel pages that should be both ‘website’ and ‘actors’, and so on, until the last crawler finishes.

This makes it possible to compensate for use cases where URL schemes are unpredictable, but we still want to have structured, categorised search.

(from github.com/farooqsheikhpk)
@micakovic @marevol

Any one guide me more about

System -> General -> Simultaneous Crawler Config = 1

I also have a use case where URL schemes are unpredictable.