(from github.com/qmaxquique)
Hello. First and foremost, Thank you for developing such amazing tool !
I’m using the Fess docker version 11.0.1 and I was able to replicate the issue in the codelibs/fess:latest (11.2) as well.
I can crawl and index several sites without any issues, but when I try to get this particular site Fess only gets the base path, the robots.txt file and then it ends the job.
This is the crawler configuration:
ID AVzIY5P0GSBWSHlT4_Uo
Name www.durect.com
URLs http://www.durect.com/
Included URLs For Crawling http://www.durect.com/.*
Excluded URLs For Crawling
Included URLs For Indexing
Excluded URLs For Indexing
Config Parameters
Depth
Max Access Count
User Agent Mozilla/5.0 (compatible; Fess/11.0; +http://fess.codelibs.org/bot.html)
The number of Tread 3
Interval time 1500 ms
Boost 1.0
Permissions {role}www.durect.com
Label
Status Enabled
Description
This is the Job
Name Web Crawler - www.durect.com
Target all
Schedule 10 5 * * 1,3,5
Executor groovy
Script return container.getComponent("crawlJob").logLevel("info").sessionId("AVzIY5P0GSBWSHlT4_Uo").webConfigIds(["AVzIY5P0GSBWSHlT4_Uo"] as String[]).fileConfigIds([] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();
Logging Enabled
Crawler Job Enabled
Status Enabled
Display Order 10
And this is what the logs are saying (Just pasting after the first warning shown)
2017-06-21 02:04:34,675 [main] WARN Failed to find a usable hardware address from the network interfaces; using random bytes: 25:95:84:bc:bf:40:a8:c7
2017-06-21 02:04:38,896 [main] INFO Lasta Di boot successfully.
2017-06-21 02:04:38,898 [main] INFO SmartDeploy Mode: Warm Deploy
2017-06-21 02:04:38,899 [main] INFO Smart Package: org.codelibs.fess.app
2017-06-21 02:04:38,945 [main] INFO Starting Crawler..
2017-06-21 02:04:38,998 [WebFsCrawler] INFO no modules loaded
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.transport.Netty3Plugin]
2017-06-21 02:04:38,999 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.transport.Netty4Plugin]
2017-06-21 02:04:39,078 [WebFsCrawler] INFO Connected to localhost:9301
2017-06-21 02:04:39,163 [WebFsCrawler] INFO Target URL: http://www.durect.com/
2017-06-21 02:04:39,163 [WebFsCrawler] INFO Included URL: http://www.durect.com/.*
2017-06-21 02:04:39,273 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO Crawling URL: http://www.durect.com/
2017-06-21 02:04:39,353 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO Checking URL: http://www.durect.com/robots.txt
2017-06-21 02:04:49,191 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:04:59,184 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:09,185 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:19,186 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:22,267 [WebFsCrawler] INFO [EXEC TIME] crawling time: 43289ms
2017-06-21 02:05:29,186 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:29,186 [IndexUpdater] INFO [EXEC TIME] index update time: 19ms
2017-06-21 02:05:29,205 [main] INFO Finished Crawler
2017-06-21 02:05:29,233 [main] INFO [CRAWL INFO] CrawlerEndTime=2017-06-21T02:05:29.205+0000,WebFsCrawlExecTime=43289,CrawlerStatus=true,CrawlerStartTime=2017-06-21T02:04:38.945+0000,WebFsCrawlEndTime=2017-06-21T02:05:29.204+0000,WebFsIndexExecTime=19,WebFsIndexSize=0,CrawlerExecTime=50260,WebFsCrawlStartTime=2017-06-21T02:04:38.963+0000
2017-06-21 02:05:34,255 [main] INFO Disconnected to elasticsearch:localhost:9301
2017-06-21 02:05:35,790 [main] INFO Destroyed LaContainer.
Can you please help me to figure out what might be happening?
Thanks in advance,
Enrique