Crawling with references to other sites

discuss · December 12, 2016, 3:14pm

(from github.com/pcolmer)
I want to set up Fess with multiple crawlers where each crawler looks after a single web site with a label to match (thus making it easy to filter search results by web site).

However, some of our web sites link to PDFs that are hosted on other web sites. Is there a way that I can easily get those PDFs included in the search results?

Thanks.

discuss · December 12, 2016, 9:27pm

(from github.com/marevol)
You can use Included Paths for Label(do not specify labels in web crawling config).
http://fess.codelibs.org/10.3/admin/labeltype-guide.html#included-paths

discuss · December 13, 2016, 10:23am

(from github.com/pcolmer)
So if I’ve got something like:

https://www.96boards.org/.*

as the primary path to be included, can I then have:

.*/.*\.pdf

to have fess include PDFs that it finds from any referrals from www.96boards.org? Or do I need to be explicit for the URL?

discuss · December 13, 2016, 10:24am

(from github.com/pcolmer)
When you say “do not specify labels in web crawling config”, is that because configuring the labels with the paths to be included is the correct way to do it? It is a little bit confusing because you can specify labels in the web crawling config, so I’m trying to understand the best/correct way to do things.

discuss · December 13, 2016, 1:21pm

(from github.com/marevol)
I might miss something…
If you want to get PDF in a search result, use Facet search(right side menu: FIle Type):
https://search.n2sm.co.jp/search/?q=test&ex_q=filetype%3Apdf
This does not require label or other settings.

If you want to categorize search results, Label is useful.
There is 2 way to set labels to indexed documents:

Select labels in Crawling Config: http://fess.codelibs.org/10.3/admin/webconfig-guide.html#labels
Specify url pattern in Label Setting: http://fess.codelibs.org/10.3/admin/labeltype-guide.html#included-paths

discuss · December 13, 2016, 1:24pm

(from pcolmer (Philip Colmer) · GitHub)

If you want to get PDF in a search result, use Facet search(right side menu: FIle Type):
test-Fess
This does not require label or other settings.

The challenge I’ve got is that the PDFs are not hosted on the web site that is being crawled. So I’m thinking that having:

https://www.96boards.org/.*
*/.*\.pdf

might get the crawler to retrieve and then index any PDFs - stored on any web site - that are referenced from pages on www.96boards.org.

Looking at the fess log, it looks like those paths are working … except that the PDFs are not being retrieved from Elasticsearch when I try to search for them using words that I know appear in the PDF. Do I need to put something into Included URLs For Indexing as well, or should the system be indexing everything it retrieves?

discuss · December 13, 2016, 1:35pm

(from github.com/pcolmer)
I think this is working as I need it to be. Thanks for creating such a great platform!