Unable to crawl when robots.txt not found

(from github.com/clauded)
I have a Sharepoint web site with no robots.txt. When the crawler tries to fetch the page, it get a HTTP/1.1 404 NOT FOUND.

Also the landing page redirects to another page with HTTP/1.1 301 Moved Permanently (redirected to www.mysite.org/fr/Pages/accueil.aspx). The processing stops there for the whole site so no index is created. Included URLs For Crawling is set to : http://www.mysite.org.*

How should I setup my crawling job?

(from github.com/marevol)
Could you check logs/fess-crawler.log?

(from github.com/clauded)
I have logs set to debug. Processing stops after getting “Object moved to href=”/Pages/accueil-redir.aspx">here" in fact.

(from github.com/clauded)
I edited some of my settings and now here’s my problem :

  • http://www.mysite.org is redirected (301) to /Pages/accueil-redir.aspx
  • /Pages/accueil-redir.aspx is redirected (301) to /fr/
  • /fr/ is redirected (301) to /fr/Pages/accueil.aspx

Here’s an edited extract of the log:

2017-05-29 14:42:30,023 [WebFsCrawler] INFO  Target URL: http://www.mysite.org
2017-05-29 14:42:30,023 [WebFsCrawler] INFO  Included URL: http://www.mysite.org.*
2017-05-29 14:42:30,023 [WebFsCrawler] INFO  Included URL: /Pages.*
2017-05-29 14:42:30,023 [WebFsCrawler] INFO  Included URL: /fr.*
2017-05-29 14:42:30,023 [WebFsCrawler] INFO  Included URL: /en.*
2017-05-29 14:42:30,023 [WebFsCrawler] INFO  Excluded URL: http://www\.somesite\.org/.*
2017-05-29 14:42:30,024 [WebFsCrawler] INFO  Excluded URL: .*\.png
2017-05-29 14:42:30,024 [WebFsCrawler] INFO  Excluded URL: .*\.jpg
2017-05-29 14:42:30,024 [WebFsCrawler] INFO  Excluded URL: .*\.gif
2017-05-29 14:42:30,024 [WebFsCrawler] INFO  Excluded URL: .*\.ico
2017-05-29 14:42:30,024 [WebFsCrawler] INFO  Excluded URL: .*\.css
2017-05-29 14:42:30,024 [WebFsCrawler] INFO  Excluded URL: .*\.js
2017-05-29 14:42:30,024 [WebFsCrawler] DEBUG Crawling http://www.mysite.org
2017-05-29 14:42:30,033 [IndexUpdater] DEBUG Starting indexUpdater.
2017-05-29 14:42:30,180 [Crawler-20170529144221-1-1] DEBUG Queued URL: [UrlQueueImpl [id=20170529144221-1.aHR0cDovL3d3dy5yZXRyYWl0ZX
...
IuYXNweA, sessionId=20170529143758-1, method=GET, url=/Pages/accueil-redir.aspx, encoding=null, parentUrl=http://www.mysite.org, depth=1, lastModified=null, createTime=1496083086433]]
2017-05-29 14:38:16,498 [Crawler-20170529143758-1-2] INFO  Crawling URL: /Pages/accueil-redir.aspx
2017-05-29 14:38:16,498 [Crawler-20170529143758-1-2] INFO  Unsupported URL: /Pages/accueil-redir.aspx

(from github.com/marevol)
I think that it’s https://github.com/codelibs/fess-crawler/commit/c0dd033422369bede2a80e8cc7634e8540cbda18
I’ll release fixed versions in this week.

(from github.com/clauded)
I recompiled everything from source. The error is gone but I still can’t index my site:

2017-05-30 17:14:26,312 [main] DEBUG Connection manager shut down
2017-05-30 17:14:26,312 [Crawler-20170530161406-1-1] DEBUG Connection released: [id: 0][route: {}->http://www.retraitequebec.gouv.qc.ca:80][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2017-05-30 17:14:26,314 [Crawler-20170530161406-1-1] INFO  I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://www.retraitequebec.gouv.qc.ca:80: The target server failed to respond

My setup:

URLs 	
http://www.retraitequebec.gouv.qc.ca
Included URLs For Crawling 	
http://www.retraitequebec.gouv.qc.ca.*
/Pages.*
/fr.*
/en.*
Excluded URLs For Crawling 	
/robots.txt
.*\.png
.*\.jpg
.*\.gif
.*\.ico
.*\.css
.*\.js

(from github.com/marevol)
It seems that it’s a network problem in your environment.

(from github.com/clauded)
Did some more testing and I don’t think it’s a network issue as it works to this point and then I keep getting “The url is null.”:

2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG Executing request GET /fr/Pages/accueil.aspx HTTP/1.1
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG Target auth state: UNCHALLENGED
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG Proxy auth state: UNCHALLENGED
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> GET /fr/Pages/accueil.aspx HTTP/1.1
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> Host: www.retraitequebec.gouv.qc.ca:80
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> Connection: Keep-Alive
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> User-Agent: Mozilla/5.0 (compatible; Fess/11.2; +http://fess.codelibs.org/bot.html)
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> Cookie: Langue=Francais; RRQ_SupporteCookie=oui; TS016ba385=01298634be706c66bdb08f9542a48ba3d33972b7c547f4f0ef611861700990ab236ab764b03fda205989e3b7463ea6d8fbafdca1f48e715dc3aceac0baa2b73fa1334f9961
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> Accept-Encoding: gzip,deflate
2017-05-31 12:02:22,245 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "GET /fr/Pages/accueil.aspx HTTP/1.1[\r][\n]"
2017-05-31 12:02:22,246 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "Host: www.retraitequebec.gouv.qc.ca:80[\r][\n]"
2017-05-31 12:02:22,246 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
2017-05-31 12:02:22,246 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "User-Agent: Mozilla/5.0 (compatible; Fess/11.2; +http://fess.codelibs.org/bot.html)[\r][\n]"
2017-05-31 12:02:22,246 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "Cookie: Langue=Francais; RRQ_SupporteCookie=oui; TS016ba385=01298634be706c66bdb08f9542a48ba3d33972b7c547f4f0ef611861700990ab236ab764b03fda205989e3b7463ea6d8fbafdca1f48e715dc3aceac0baa2b73fa1334f9961[\r][\n]"
2017-05-31 12:02:22,246 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2017-05-31 12:02:22,246 [Crawler-20170531120142-1-2] DEBUG http-outgoing-0 >> "[\r][\n]"
2017-05-31 12:02:22,611 [Crawler-20170531120142-1-4] DEBUG The url is null. (3)
2017-05-31 12:02:22,614 [Crawler-20170531120142-1-5] DEBUG The url is null. (3)
2017-05-31 12:02:22,734 [Crawler-20170531120142-1-1] DEBUG The url is null. (0)
2017-05-31 12:02:22,933 [Crawler-20170531120142-1-3] DEBUG The url is null. (2)
2017-05-31 12:02:26,160 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2017-05-31 12:02:26,161 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS

(from github.com/marevol)
“The url is null.” is that url queue is empty. So, it’s not a problem.

If “Crawler-20170531120142-1-2” did not print any other messages, the target url did not return a response(it seems to block the request).

(from github.com/clauded)
Fess is indexing other sites so it’s probably not a low level network problem. Strangely, it went a bit further last night as it got a 200 response from the server. Here’s a good and a bad attempt:

-------------------- GOOD ATTEMPT ------------------------------------------------------------------------
2017-06-01 00:00:32,757 [Crawler-20170601000000-1-5] DEBUG Queued URL: [UrlQueueImpl [id=20170601000000-1.aHR0cDovL3d3dy5yZXRyYWl0ZXF1ZWJlYy5nb3V2LnFjLmNhL2ZyL1BhZ2VzL2FjY3VlaWwuYXNweA, sessionId=20170601000000-1, method=GET, url=http://www.retraitequebec.gouv.qc.ca/fr/Pages/accueil.aspx, encoding=null, parentUrl=http://www.retraitequebec.gouv.qc.ca/fr/, depth=3, lastModified=null, createTime=1496289632483]]
2017-06-01 00:00:32,760 [Crawler-20170601000000-1-1] INFO  Crawling URL: http://www.retraitequebec.gouv.qc.ca/fr/Pages/accueil.aspx
2017-06-01 00:00:32,760 [Crawler-20170601000000-1-1] DEBUG Getting the content from URL: http://www.retraitequebec.gouv.qc.ca/fr/Pages/accueil.aspx
2017-06-01 00:00:32,760 [Crawler-20170601000000-1-1] DEBUG Accessing http://www.retraitequebec.gouv.qc.ca/fr/Pages/accueil.aspx
2017-06-01 00:00:32,761 [Crawler-20170601000000-1-1] DEBUG CookieSpec selected: default
2017-06-01 00:00:32,761 [Crawler-20170601000000-1-1] DEBUG Cookie [version: 0][name: Langue][value: Francais][domain: www.retraitequebec.gouv.qc.ca][path: /][expiry: Fri Dec 01 00:00:21 EST 2017] match [www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx]
2017-06-01 00:00:32,762 [Crawler-20170601000000-1-1] DEBUG Cookie [version: 0][name: RRQ_SupporteCookie][value: oui][domain: www.retraitequebec.gouv.qc.ca][path: /][expiry: null] match [www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx]
2017-06-01 00:00:32,762 [Crawler-20170601000000-1-1] DEBUG Cookie [version: 0][name: TS016ba385][value: 01298634beea246d40eb4f91e89adb2a5064a378dad4285622a4bebc3a88595561c85cbbf446e7435841593e1cdfe4dea27f87ca6629e72f85c70dc36c01079e144d06cb0d][domain: www.retraitequebec.gouv.qc.ca][path: /][expiry: null] match [www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx]
2017-06-01 00:00:32,762 [Crawler-20170601000000-1-1] DEBUG Connection request: [route: {}->http://www.retraitequebec.gouv.qc.ca:80][total kept alive: 1; route allocated: 1 of 20; total allocated: 1 of 200]

2017-06-01 00:00:32,763 [Crawler-20170601000000-1-1] DEBUG Connection leased: [id: 0][route: {}->http://www.retraitequebec.gouv.qc.ca:80][total kept alive: 0; route allocated: 1 of 20; total allocated: 1 of 200]
2017-06-01 00:00:32,763 [Crawler-20170601000000-1-1] DEBUG Executing request GET /fr/Pages/accueil.aspx HTTP/1.1
2017-06-01 00:00:32,764 [Crawler-20170601000000-1-1] DEBUG Target auth state: UNCHALLENGED
2017-06-01 00:00:32,764 [Crawler-20170601000000-1-1] DEBUG Proxy auth state: UNCHALLENGED
2017-06-01 00:00:32,764 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> GET /fr/Pages/accueil.aspx HTTP/1.1
2017-06-01 00:00:32,764 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> Host: www.retraitequebec.gouv.qc.ca
2017-06-01 00:00:32,764 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> Connection: Keep-Alive
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> Cookie: Langue=Francais; RRQ_SupporteCookie=oui; TS016ba385=01298634beea246d40eb4f91e89adb2a5064a378dad4285622a4bebc3a88595561c85cbbf446e7435841593e1cdfe4dea27f87ca6629e72f85c70dc36c01079e144d06cb0d
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> Accept-Encoding: gzip,deflate
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "GET /fr/Pages/accueil.aspx HTTP/1.1[\r][\n]"
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "Host: www.retraitequebec.gouv.qc.ca[\r][\n]"
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
2017-06-01 00:00:32,765 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)[\r][\n]"
2017-06-01 00:00:32,766 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "Cookie: Langue=Francais; RRQ_SupporteCookie=oui; TS016ba385=01298634beea246d40eb4f91e89adb2a5064a378dad4285622a4bebc3a88595561c85cbbf446e7435841593e1cdfe4dea27f87ca6629e72f85c70dc36c01079e144d06cb0d[\r][\n]"
2017-06-01 00:00:32,766 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2017-06-01 00:00:32,766 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 >> "[\r][\n]"
2017-06-01 00:00:32,772 [Crawler-20170601000000-1-5] DEBUG The url is null. (0)
2017-06-01 00:00:32,805 [Crawler-20170601000000-1-1] DEBUG http-outgoing-0 << "HTTP/1.1 200 OK[\r][\n]"

-------------------- BAD ATTEMPT  ------------------------------------------------------------------------

2017-06-01 09:57:55,672 [Crawler-20170601095706-1-1] DEBUG Queued URL: [UrlQueueImpl [id=20170601095706-1.aHR0cDovL3d3dy5yZXRyYWl0ZXF1ZWJlYy5nb3V2LnFjLmNhOjgwL2ZyL1BhZ2VzL2FjY3VlaWwuYXNweA, sessionId=20170601095706-1, method=GET, url=http://www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx, encoding=null, parentUrl=http://www.retraitequebec.gouv.qc.ca/fr/Pages/accueil.aspx, depth=4, lastModified=null, createTime=1496325465648]]
2017-06-01 09:57:55,690 [Crawler-20170601095706-1-1] INFO  Crawling URL: http://www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx
2017-06-01 09:57:55,691 [Crawler-20170601095706-1-1] DEBUG Getting the content from URL: http://www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx
2017-06-01 09:57:55,691 [Crawler-20170601095706-1-1] DEBUG Accessing http://www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx
2017-06-01 09:57:55,691 [Crawler-20170601095706-1-1] DEBUG CookieSpec selected: default
2017-06-01 09:57:55,691 [Crawler-20170601095706-1-1] DEBUG Cookie [version: 0][name: Langue][value: Francais][domain: www.retraitequebec.gouv.qc.ca][path: /][expiry: Fri Dec 01 09:57:24 EST 2017] match [www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx]
2017-06-01 09:57:55,691 [Crawler-20170601095706-1-1] DEBUG Cookie [version: 0][name: RRQ_SupporteCookie][value: oui][domain: www.retraitequebec.gouv.qc.ca][path: /][expiry: null] match [www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx]
2017-06-01 09:57:55,692 [Crawler-20170601095706-1-1] DEBUG Cookie [version: 0][name: TS016ba385][value: 01298634be077a5e19d3acca129644a15fa316175b5b3761dbea3b9808ea559033b5fda8d44ba40f8318023cdaaf16b0c193626c06d4e294ff38fc17e0f20cf47346ab3394][domain: www.retraitequebec.gouv.qc.ca][path: /][expiry: null] match [www.retraitequebec.gouv.qc.ca:80/fr/Pages/accueil.aspx]
2017-06-01 09:57:55,692 [Crawler-20170601095706-1-1] DEBUG Connection request: [route: {}->http://www.retraitequebec.gouv.qc.ca:80][total kept alive: 1; route allocated: 1 of 20; total allocated: 1 of 200]

2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 << "[read] I/O error: Read timed out"

2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG Connection leased: [id: 0][route: {}->http://www.retraitequebec.gouv.qc.ca:80][total kept alive: 0; route allocated: 1 of 20; total allocated: 1 of 200]
2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG Executing request GET /fr/Pages/accueil.aspx HTTP/1.1
2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG Target auth state: UNCHALLENGED
2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG Proxy auth state: UNCHALLENGED
2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> GET /fr/Pages/accueil.aspx HTTP/1.1
2017-06-01 09:57:55,693 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> Host: www.retraitequebec.gouv.qc.ca:80
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> Connection: Keep-Alive
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> Cookie: Langue=Francais; RRQ_SupporteCookie=oui; TS016ba385=01298634be077a5e19d3acca129644a15fa316175b5b3761dbea3b9808ea559033b5fda8d44ba40f8318023cdaaf16b0c193626c06d4e294ff38fc17e0f20cf47346ab3394
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> Accept-Encoding: gzip,deflate
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "GET /fr/Pages/accueil.aspx HTTP/1.1[\r][\n]"
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "Host: www.retraitequebec.gouv.qc.ca:80[\r][\n]"
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36[\r][\n]"
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "Cookie: Langue=Francais; RRQ_SupporteCookie=oui; TS016ba385=01298634be077a5e19d3acca129644a15fa316175b5b3761dbea3b9808ea559033b5fda8d44ba40f8318023cdaaf16b0c193626c06d4e294ff38fc17e0f20cf47346ab3394[\r][\n]"
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2017-06-01 09:57:55,694 [Crawler-20170601095706-1-1] DEBUG http-outgoing-0 >> "[\r][\n]"
2017-06-01 09:57:59,427 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2017-06-01 09:57:59,428 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2017-06-01 09:58:04,319 [IndexUpdater] DEBUG Processing documents in IndexUpdater queue.
2017-06-01 09:58:04,320 [IndexUpdater] DEBUG Getting documents in IndexUpdater queue.
2017-06-01 09:58:04,324 [IndexUpdater] INFO  Processing no docs (Doc:{access 4ms}, Mem:{used 148MB, heap 193MB, max 494MB})
2017-06-01 09:58:04,325 [IndexUpdater] DEBUG Processed documents in IndexUpdater queue.
2017-06-01 09:58:04,429 [CoreLib-TimeoutManager] DEBUG Closing expired connections

I wonder how the port (:80) gets added to the host? Done by Fess or instructed by the server? Could this cause the timeout error?

(from marevol (Shinsuke Sugaya) · GitHub)

parentUrl=Retraite Québec - Accueil Retraite Québec

The above url returned next url with 80.
I don’t know why the server returned it…
It might be better to use Path Mapping as a workaround.
Path Mapping can replace www.retraitequebec.gouv.qc.ca:80 with www.retraitequebec.gouv.qc.ca at Crawling time.

(from github.com/clauded)
I did run wireshark on the host : the crawler sends GET /fr/Pages/accueil.aspx HTTP/1.1 to host www.retraitequebec.gouv.qc.ca, it receives an ACK from the web server then nothing happens, the requested page is not sent. I can retrieve the same page with a wget on the server. I’m stuck as this Fess server can crawl other sites without problems.

(from github.com/marevol)
For this problem, IIS might print log messages.

In the current releases, I think no workaround… I’ll support this issue in the next release.