(from github.com/micakovic)
I cannot predict all the patterns, though. there could be hundreds, or thousands of pages, and they can be anywhere within the URL structure. A web page that goes like this site.com/something/else could be suitable to be labelled as a book, but site.com/something/other should not, and site.com/else/34234 should be book, and site.com/groups/seven should be both a book, and a magazine, and be part of the global site search.
To me, it seems like the only way to go about this is to have multiple crawlers for the same site, but each of the crawlers would append labels, not overwrite them.
Example: site.com
crawler ‘site - books’:
crawl url site/util/books, depth 1, label as ‘books’, ‘website’
crawler ‘site - magazines’
crawl url site/util/magazines, depth 1, label as ‘magazines’, if 'books for each URL exist do not overwrite, also append ‘website’ is not exist.
crawler ‘site - everywhere’
crawl site.com/ label ‘website’ for each page, if ‘books’ and/or ‘magazine’ label exists for each page do not overwrite, but append.
The outcome would be, for example:
The first crawler runs and indexes site.com/page1. It is labelled as 'books, ‘website’.
The second crawler runs and finds site.com/page1. It does nothing.
The third crawler, finds site.com/page1, wants to add ‘website’ to it, but it is already there.
site.com/page1 now has labels ‘books’, ‘website’.
The first crawler finds site.com/sub/sub/page2 does nothing.
The second crawler finds site.com/sub/sub/page2 does nothing.
The third crawler finds site.com/sub/sub/page2 labels it as 'website.
site.com/sub/sub/page2 is labelled as ‘website’.
The first crawler runs and finds site.com/2017/mag/old34-68. Label as ‘books’.
The second crawler also finds it and labels it as ‘magazine’, but leaves ‘book’ as well.
The third crawler finds it and labels it as ‘website’, leaving the first two intact.
site.com/2017/mag/old34-68 can now be found under labels ‘books’, ‘magazines’, ‘website’.
Like this, it becomes easy and manageable to categorise content on the same site, or even multiple sites, where URL patterns cannot be predicted and regex matched.