(from github.com/swiftredvette)
For some reason, Fess is struggling to index a Wordpress site. All other sites in the URL list succeed.
The logs show that it is attempting to index each url duplicated and separated by a space. Example log entry:
org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://www2.ncte.org/groups/cel/http://www2.ncte.org/groups/cel/
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:579)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/http://www2.ncte.org/groups/cel/
We’ve tried pointing the scanner to the website, as well as pointing to the sitemap. Both result in the same issue.
How can we configure Fess to successfully scan wordpress?
org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://www2.ncte.org/groups/cel/http://www2.ncte.org/groups/cel/
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:579)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/http://www2.ncte.org/groups/cel/
at java.net.URI.create(Unknown Source)
at org.apache.http.client.methods.HttpGet.(HttpGet.java:66)
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:577)
… 4 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/http://www2.ncte.org/groups/cel/
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parseHierarchical(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.(Unknown Source)