(from github.com/davidhoelzel)
Hello,
we are using a corporate proxy.
Configured in Web-Crawler configuration with proxyHost and proxyPort.
Worked fine for months.
After our admins did some changes to the proxy, the proxy returns a 503 when we try to fetch specific domains (the domains configured for the web-crawler).
When trying to fetch one of the affected URLs with curl directly, curl returns the error from the proxy:
“curl: (56) Received HTTP code 503 from proxy after CONNECT”.
The curl command used:
curl --proxy http://our.proxy.com:3128 https://www.example.com/de/
However, even after turning log-level to debug (or even log-level all), there are no crawling problems appearing in the crawler log from FESS:
2019-10-14 13:35:54,874 [WebFsCrawler] INFO Target URL: https://www.example.com/de/
2019-10-14 13:35:54,888 [WebFsCrawler] INFO Target URL: https://www.example.com/en/
2019-10-14 13:35:54,904 [WebFsCrawler] INFO Target URL: https://www.example2.com/
2019-10-14 13:35:54,905 [WebFsCrawler] INFO Included URL: https://www.example.com/de/.*
2019-10-14 13:35:54,905 [WebFsCrawler] INFO Included URL: https://www.example.com/en/.*
2019-10-14 13:35:54,905 [WebFsCrawler] INFO Included URL: https://www.example2.com/.*
2019-10-14 13:35:55,023 [Crawler-20191014133549-1-3] INFO Crawling URL: https://www.example.com/en/
2019-10-14 13:35:55,023 [Crawler-20191014133549-1-1] INFO Crawling URL: https://www.example2.com/
2019-10-14 13:35:55,024 [Crawler-20191014133549-1-2] INFO Crawling URL: https://www.example.com/de/
2019-10-14 13:35:55,088 [Crawler-20191014133549-1-2] INFO Checking URL: https://www.example.com/robots.txt
2019-10-14 13:35:55,088 [Crawler-20191014133549-1-1] INFO Checking URL: https://www.example2.com/robots.txt
2019-10-14 13:36:04,938 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 5ms}, Mem:{used 142MB, heap 512MB, max 512MB})
2019-10-14 13:36:14,921 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 4ms}, Mem:{used 144MB, heap 512MB, max 512MB})
2019-10-14 13:36:24,921 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 4ms}, Mem:{used 146MB, heap 512MB, max 512MB})
2019-10-14 13:36:26,251 [WebFsCrawler] INFO [EXEC TIME] crawling time: 31472ms
2019-10-14 13:36:34,922 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 5ms}, Mem:{used 146MB, heap 512MB, max 512MB})
2019-10-14 13:36:34,922 [IndexUpdater] INFO [EXEC TIME] index update time: 34ms
2019-10-14 13:36:34,968 [main] INFO Finished Crawler
It would be nice if the logging would contain such errors, at least on debug-level.
Fess Version: 13.4, running in docker-container (image codelibs/fess:13.4)