Crawl PDF pages


I am interested in crawling, indexing and querying content from PDF pages with Fess. By providing URLs in the “Included URLs to crawl” list the WebCrawler is not returning any PDF pages. How do I set it up in order to crawl PDF pages?

Thanks in advance.


Could you provide more details for the web crawling setting?

The URLs to the PDFs have the following form


In the meantime I managed to crawl PDF from another site. So PDFs themselves are no problem.

Could it be that that URLs containing ? are not matched?

Did you add a pattern for the PDF path? You need to add all the patterns to crawl them. So, to check if your setting is correct, it’s better to use a simple pattern, ex. https://www\.example\.com/en/.*.