Append labels

Is it possible to have multiple web crawlers append labels for to existing documents, rather than overwrite them?


Labels: films, actors, website

Web crawler called ‘films’ gets and labels it with films, website.

page1 now appears in searches only if the label is films, or the label is website.

Web crawler called ‘actors’ runs now and finds the same page, then labels it with actors, website.

page1 now appears in searches only if the label is actors, or the label is website, but not if someone searches this page under the label films. This is now wrong, as this page should be labelled with all three labels: films, actors, website.

No. Multiple crawlers should use the same crawling setting.
I think it’s better to use Label’s Included Paths and not set Web Crawling’s Labels.

That makes it impossible in any website where paths are not predictable. For example, if you want to crawl and have multiple crawlers to accommodate this:

  1. Crawler #1 labels ‘books’, ‘website’. Crawl which displays a list of pages (generated dynamically from a database), then crawl with the depth of 1, each of the URLs will be crawled and labelled ‘books’, ‘website.’ These pages could appear anywhere on the website, and there is no way to predict whether it will be,,, and so on.

  2. Crawler #2 crawls Same story as above, only each of the URLs is crawled with the depth of 1 and labelled ‘magazine’, ‘website’.

  3. Crawler #3 crawls the entire site, labels ‘website’.

2 will obliterate 1, and 3 will obliterate both 1 and 2.

I would like for users to be able to see all magazines when filtering by label ‘magazine’ The same for books, when filtering by label 'books. I would also like them to find both books and magazines when filtering by label ‘website’. Finally, I would like them to find all ‘magazines’ even if some magazines appear in both ‘magazines’ and ‘books’ labels.

What would be the suggested way to go about this?


use Label’s Included Paths and not set Web Crawling’s Labels.

and then create ONE crawler for

And how would you dynamically populate included paths? Each time someone adds a new page, you would have to add it by hand to included paths? Each time someone removes the page, you would have to manually remove it? And if you have, say, 1M books, would you add 1M lines to included paths?

How about putting all Java Regex patterns to Included Paths?

I cannot predict all the patterns, though. there could be hundreds, or thousands of pages, and they can be anywhere within the URL structure. A web page that goes like this could be suitable to be labelled as a book, but should not, and should be book, and should be both a book, and a magazine, and be part of the global site search.

To me, it seems like the only way to go about this is to have multiple crawlers for the same site, but each of the crawlers would append labels, not overwrite them.


crawler ‘site - books’:
crawl url site/util/books, depth 1, label as ‘books’, ‘website’

crawler ‘site - magazines’
crawl url site/util/magazines, depth 1, label as ‘magazines’, if 'books for each URL exist do not overwrite, also append ‘website’ is not exist.

crawler ‘site - everywhere’
crawl label ‘website’ for each page, if ‘books’ and/or ‘magazine’ label exists for each page do not overwrite, but append.

The outcome would be, for example:

The first crawler runs and indexes It is labelled as 'books, ‘website’.
The second crawler runs and finds It does nothing.
The third crawler, finds, wants to add ‘website’ to it, but it is already there. now has labels ‘books’, ‘website’.

The first crawler finds does nothing.
The second crawler finds does nothing.
The third crawler finds labels it as 'website. is labelled as ‘website’.

The first crawler runs and finds Label as ‘books’.
The second crawler also finds it and labels it as ‘magazine’, but leaves ‘book’ as well.
The third crawler finds it and labels it as ‘website’, leaving the first two intact. can now be found under labels ‘books’, ‘magazines’, ‘website’.

Like this, it becomes easy and manageable to categorise content on the same site, or even multiple sites, where URL patterns cannot be predicted and regex matched.

Label for UI setting does not support it.
I think, to support the use case, it might work with additional field feature extracting from crawled pages.

That would be an option, too. How could one set values on an additional field feature in crawlers?

With the introduction of ordered crawling (#1290) it is possible to overcome this by doing the following.

System -> General -> Simultaneous Crawler Config = 1

Now make several crawlers and name them so that the more general ones (fewer labels) are on top. Example:

100 Website
210 Actors
220 Directors
310 Movies

Make these labels.

For each of these crawlers, supply the starting URL. Start 100 Website with the site URL. Provide a small utility for 210, 220, 310 by, for example, reading the database, selecting all pages which are ‘actors’, and printing one by one as href on the utility page. Label with both ‘website’, and ‘actors’.

Repeat for others.

This will label all pages as ‘website’ first. Then, it will relabel pages that should be both ‘website’ and ‘actors’, and so on, until the last crawler finishes.

This makes it possible to compensate for use cases where URL schemes are unpredictable, but we still want to have structured, categorised search.

@micakovic @marevol

Any one guide me more about

System -> General -> Simultaneous Crawler Config = 1

I also have a use case where URL schemes are unpredictable.