"charset = unicode" page is not crawled correctly.

discuss · June 15, 2017, 9:25am

(from github.com/Kizuna-Fukami)
I am using FESS Ver 11.1.1.
It seems that html pages encoded by “unicode” do not crawl correctly.
Is there any solution?

problem:
The “charset = unicode” page like the sample below is not indexed properly.
The html file itself seems to be indexed by crawling, but it seems that the content can not be encoded correctly.

Recently, when creating a web page from Microsoft office, it seems that such “unicode” html page is created.
I am in trouble because these pages are not searched.

Just in case, if it is a unicode page, not only Microsoft’s html page but also any page will not be searched.

Thanks.

Sample:

discuss · June 15, 2017, 12:18pm

(from github.com/marevol)
Could you please provide the sample file?

discuss · June 15, 2017, 11:31pm

(from github.com/Kizuna-Fukami)
Thank you for your quick response.
I attached the 3 sample files. shift_jis, utf-8, and unicode.
Only “unicode” page are not searched.

sample.zip

discuss · June 16, 2017, 3:51am

(from github.com/marevol)
Thank you for the info.
Although I do not think “unicode” for meta charset is valid, it will be support as UTF-16LE in the next release.
https://stackoverflow.com/questions/20529313/is-charset-unicode-utf-8-utf-16-or-something-else

discuss · June 16, 2017, 4:58am

(from github.com/Kizuna-Fukami)
marevol san,
Thank you for your info.
I understood the situation.
I am looking forward to the next release.
Thanks.