(from github.com/pcolmer)
I want to set up Fess with multiple crawlers where each crawler looks after a single web site with a label to match (thus making it easy to filter search results by web site).
However, some of our web sites link to PDFs that are hosted on other web sites. Is there a way that I can easily get those PDFs included in the search results?
(from github.com/pcolmer)
When you say “do not specify labels in web crawling config”, is that because configuring the labels with the paths to be included is the correct way to do it? It is a little bit confusing because you can specify labels in the web crawling config, so I’m trying to understand the best/correct way to do things.
The challenge I’ve got is that the PDFs are not hosted on the web site that is being crawled. So I’m thinking that having:
https://www.96boards.org/.*
*/.*\.pdf
might get the crawler to retrieve and then index any PDFs - stored on any web site - that are referenced from pages on www.96boards.org.
Looking at the fess log, it looks like those paths are working … except that the PDFs are not being retrieved from Elasticsearch when I try to search for them using words that I know appear in the PDF. Do I need to put something into Included URLs For Indexing as well, or should the system be indexing everything it retrieves?