Fess Cannot Crawl Wordpress Site

(from github.com/swiftredvette)
For some reason, Fess is struggling to index a Wordpress site. All other sites in the URL list succeed.

The logs show that it is attempting to index each url duplicated and separated by a space. Example log entry:

org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:579)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/

We’ve tried pointing the scanner to the website, as well as pointing to the sitemap. Both result in the same issue.

How can we configure Fess to successfully scan wordpress?

(from github.com/marevol)
Could you provide the entire stack trace?

(from github.com/swiftredvette)
URL: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/ << notice the duplicate

org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:579)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at java.net.URI.create(Unknown Source)
at org.apache.http.client.methods.HttpGet.(HttpGet.java:66)
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:577)
… 4 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parseHierarchical(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.(Unknown Source)

(from github.com/marevol)
I think that the link exists in href attribute of a tag.

(from github.com/swiftredvette)
That’s what we thought as well, however it isn’t just this one page – it’s every page. We are scanning this sitemap:

http://www2.ncte.org/sitemap_index.xml

and every page scanned has the same behavior in Fess. Alternatively, we’ve tried scanning the site rather than the sitemap, but have the same results.

(from github.com/marevol)
Did you check Failure URL page?