Problem with encoding Polish chars in URL (ę,ó,ł,ś,ą,ż,ź,ć,ń)

discuss · June 7, 2017, 9:16pm

(from github.com/kamil0414)
org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://someURL/Uk�ady wej��__wyj��/JET-I__O/Firmware
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:580)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:135)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 84: http://someURL/Uk�ady wej��__wyj��/JET-I__O/Firmware
at java.net.URI.create(URI.java:852)
at org.apache.http.client.methods.HttpGet.(HttpGet.java:69)
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:578)
… 4 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 84: http://someURL/Uk�ady wej��__wyj��/JET-I__O/Firmware
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:588)
at java.net.URI.create(URI.java:850)
… 6 more

discuss · June 7, 2017, 11:30pm

(from github.com/marevol)
Is it set to URLs on Web Crawling Config page?

discuss · June 8, 2017, 5:34am

(from kamil0414 · GitHub)
Yes, it is (domain url).

08.06.2017 1:30 AM “Shinsuke Sugaya” notifications@github.com napisał(a):

Is it set to URLs on Web Crawling Config page?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/codelibs/fess/issues/1092#issuecomment-306954937, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALLq3AEgvabRte2w680Lht_w_bnsEytbks5sBzKlgaJpZM4NzSyV
.

discuss · June 8, 2017, 5:42am

(from github.com/marevol)
URLs need to be url-encoded by UTF-8 on Web Crawling Config page.
At crawling time, URLs in crawled HTML pages are encoded automatically.

discuss · June 8, 2017, 7:45pm

(from github.com/kamil0414)
Please, remember that Polish characters are included in UTF-8 encoding The problem occurs when the crawler finds a new address based on the domain name and tries to check it. Im sure that URL is correct and is normally opened in the browser.
My page counts about 2000 pages, and by that error the crawler can not crawl 700 of them

discuss · June 8, 2017, 9:27pm

(from github.com/marevol)
I see. In this releases, for crawling IDN site, please encode non-ascii domain by ASCII Compatible Encoding, put it to URLs on Config page, and then replace non-ascii domain with ACE-encoded domain by Duplicate Host setting. I think crawled urls are url-encoded before Duplicate Host replacement.