Crawl PDF pages

andionita · October 16, 2023, 9:51am

Hi,

I am interested in crawling, indexing and querying content from PDF pages with Fess. By providing URLs in the “Included URLs to crawl” list the WebCrawler is not returning any PDF pages. How do I set it up in order to crawl PDF pages?

Thanks in advance.

Andrei

shinsuke · October 16, 2023, 11:46am

Could you provide more details for the web crawling setting?

andionita · October 16, 2023, 12:44pm

The URLs to the PDFs have the following form

https://www\.example\.com/en/text_\d+\.htm\?selectedLocale=en

andionita · October 16, 2023, 12:49pm

In the meantime I managed to crawl PDF from another site. So PDFs themselves are no problem.

Could it be that that URLs containing ? are not matched?

shinsuke · October 17, 2023, 11:06am

Did you add a pattern for the PDF path? You need to add all the patterns to crawl them. So, to check if your setting is correct, it’s better to use a simple pattern, ex. https://www\.example\.com/en/.*.