Fess Cannot Crawl Wordpress Site

discuss · October 4, 2018, 10:55pm

(from github.com/swiftredvette)
For some reason, Fess is struggling to index a Wordpress site. All other sites in the URL list succeed.

The logs show that it is attempting to index each url duplicated and separated by a space. Example log entry:

org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:579)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/

We’ve tried pointing the scanner to the website, as well as pointing to the sitemap. Both result in the same issue.

How can we configure Fess to successfully scan wordpress?

discuss · October 4, 2018, 11:09pm

(from github.com/marevol)
Could you provide the entire stack trace?

discuss · October 5, 2018, 12:26am

(from github.com/swiftredvette)
URL: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/ << notice the duplicate

org.codelibs.fess.crawler.exception.CrawlingAccessException: The url may not be valid: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:579)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142)
at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:164)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at java.net.URI.create(Unknown Source)
at org.apache.http.client.methods.HttpGet.(HttpGet.java:66)
at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:577)
… 4 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 32: http://www2.ncte.org/groups/cel/ http://www2.ncte.org/groups/cel/
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parseHierarchical(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.(Unknown Source)

discuss · October 5, 2018, 12:47pm

(from github.com/marevol)
I think that the link exists in href attribute of a tag.

discuss · October 6, 2018, 11:52am

(from github.com/swiftredvette)
That’s what we thought as well, however it isn’t just this one page – it’s every page. We are scanning this sitemap:

http://www2.ncte.org/sitemap_index.xml

and every page scanned has the same behavior in Fess. Alternatively, we’ve tried scanning the site rather than the sitemap, but have the same results.

discuss · October 6, 2018, 12:03pm

(from github.com/marevol)
Did you check Failure URL page?