(from github.com/guenther-orth)
I’ve been looking for solutions to this problem, but I can’t find a solution in the “Issues” section.
I use fess to search PDFs in a file system. In this file system there are folders an files with german umlauts like
folder: /var/tmp/fess/PDF/Häuser
files: /var/tmp/fess/PDF/Wohnungen/Leistungssätze.pdf
In the Failure URL I found the follow details:
org.codelibs.fess.exception.ContentNotFoundException: Not Found: file:/var/tmp/fess/PDF/H%EF%BF%BD%EF%BF%BDuser Parent: file:/var/tmp/fess/PDF
How can I crawl this folders and files?
Is this a problem of the fess crawler or elasticsearch?
What kind of logs an config data do you need to analyse this issue?
Thanks a lot
Guenther
(from github.com/marevol)
Is Häuser
encoded by UTF-8 in your environment?
(from github.com/guenther-orth)
Yes, I think so. I use LANG=“de_DE.UTF-8” in my environment (Debian 9.7).
I can’t change the charset of the folder or of the file:
root@debian:/var/tmp/fess/PDF# file -bi Häuser
inode/directory; charset=binary
root@debian:/var/tmp/fess/PDF/Häuser# file -bi 40002889_4200_Leistungssätze.pdf
application/pdf; charset=binary
The programm convmv can’t convert from binary to utf8.
How can I change this to utf8? Or do I have to set an encoding in the configuration of fess?
(from github.com/guenther-orth)
Now I found a solution:
In /etc/fess/fess_config.properties I had to set the following parameters:
- crawler.document.html.default.lang=de
- crawler.document.file.name.encoding=UTF-8
- crawler.document.file.default.lang=de
Every parameter was empty by default.
Fess searchs the folders an the files with german umlauts and I can search the documents.
Now I can test, if fess is able to crawl all the 4.8 TBs of pdf
Thanks for your help!
Guenther