Processing no docs while crawling

(from github.com/GeetaLakhwani-1)

Please see the below configuration
Name - newsroom_article
URLs - https://www.example.org/allexamples/
Included URLs For Crawling - https://www.example.org/.*
Excluded URLs For Crawling - (?i).(css|js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$
Excluded URLs For Indexing - https://www.example.org/allexamples/
These configuration is not working
In Fess-crawler.logs,logs as below:-
Connected to localhost:9300
2019-10-16 12:21:09,076 [WebFsCrawler] INFO org.codelibs.fess.helper.WebFsIndexHelper - Target URL: https://www.example.org/allexample/
2019-10-16 12:21:09,077 [WebFsCrawler] INFO org.codelibs.fess.helper.WebFsIndexHelper - Included URL: https://www.example.org/.

2019-10-16 12:21:09,078 [WebFsCrawler] INFO org.codelibs.fess.helper.WebFsIndexHelper - Excluded URL: (?i).*(css|js|jpeg|jpg|gif|png|bmp|wmv|exe|mp4|pdf|doc|docx|ppt|pptx|xls|xlsx)$
2019-10-16 12:21:19,126 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 5ms}, Mem:{used 161MB, heap 245MB, max 494MB})
2019-10-16 12:21:29,113 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 161MB, heap 245MB, max 494MB})
2019-10-16 12:21:39,116 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 5ms}, Mem:{used 162MB, heap 245MB, max 494MB})
2019-10-16 12:21:49,112 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 162MB, heap 245MB, max 494MB})
2019-10-16 12:21:59,112 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 4ms}, Mem:{used 162MB, heap 245MB, max 494MB})
2019-10-16 12:22:09,111 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 163MB, heap 245MB, max 494MB})
2019-10-16 12:22:10,187 [CoreLib-TimeoutManager] INFO org.codelibs.fess.timer.SystemMonitorTarget - [SYSTEM MONITOR] {“os”:{“memory”:{“physical”:{“free”:7409065984,“total”:17034055680},“swap_space”:{“free”:6570721280,“total”:19584192512}},“cpu”:{“percent”:100},“load_averages”:null},“process”:{“file_descriptor”:{“open”:-1,“max”:-1},“cpu”:{“percent”:15,“total”:39734},“virtual_memory”:{“total”:556933120}},“jvm”:{“memory”:{“heap”:{“used”:170686888,“committed”:257490944,“max”:518979584,“percent”:32},“non_heap”:{“used”:83951536,“committed”:87646208}},“pools”:{“direct”:{“count”:40,“used”:85991441,“capacity”:85991440},“mapped”:{“count”:0,“used”:0,“capacity”:0}},“gc”:{“young”:{“count”:10,“time”:136},“old”:{“count”:2,“time”:51}},“threads”:{“count”:61,“peak”:62},“classes”:{“loaded”:10492,“total_loaded”:10495,“unloaded”:3},“uptime”:92371},“elasticsearch”:null,“timestamp”:1571208730187}
2019-10-16 12:22:19,112 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 167MB, heap 245MB, max 494MB})
2019-10-16 12:22:29,110 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 2ms}, Mem:{used 167MB, heap 245MB, max 494MB})
2019-10-16 12:22:39,112 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 167MB, heap 245MB, max 494MB})
2019-10-16 12:22:49,112 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 2ms}, Mem:{used 167MB, heap 245MB, max 494MB})
2019-10-16 12:22:59,111 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 0ms}, Mem:{used 168MB, heap 245MB, max 494MB})
2019-10-16 12:23:09,113 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 2ms}, Mem:{used 168MB, heap 245MB, max 494MB})
2019-10-16 12:23:11,323 [CoreLib-TimeoutManager] INFO org.codelibs.fess.timer.SystemMonitorTarget - [SYSTEM MONITOR] {“os”:{“memory”:{“physical”:{“free”:7526273024,“total”:17034055680},“swap_space”:{“free”:6591778816,“total”:19584192512}},“cpu”:{“percent”:29},“load_averages”:null},“process”:{“file_descriptor”:{“open”:-1,“max”:-1},“cpu”:{“percent”:0,“total”:41312},“virtual_memory”:{“total”:560922624}},“jvm”:{“memory”:{“heap”:{“used”:176175192,“committed”:257490944,“max”:518979584,“percent”:33},“non_heap”:{“used”:85597432,“committed”:89546752}},“pools”:{“direct”:{“count”:40,“used”:85991441,“capacity”:85991440},“mapped”:{“count”:0,“used”:0,“capacity”:0}},“gc”:{“young”:{“count”:10,“time”:136},“old”:{“count”:2,“time”:51}},“threads”:{“count”:61,“peak”:62},“classes”:{“loaded”:10632,“total_loaded”:10635,“unloaded”:3},“uptime”:153592},“elasticsearch”:null,“timestamp”:1571208791323}
2019-10-16 12:23:19,115 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 168MB, heap 245MB, max 494MB})
2019-10-16 12:23:29,117 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 4ms}, Mem:{used 168MB, heap 245MB, max 494MB})
2019-10-16 12:23:39,114 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 168MB, heap 245MB, max 494MB})
2019-10-16 12:23:49,114 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 169MB, heap 245MB, max 494MB})
2019-10-16 12:23:59,115 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 169MB, heap 245MB, max 494MB})
2019-10-16 12:24:09,115 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 169MB, heap 245MB, max 494MB})
2019-10-16 12:24:12,472 [CoreLib-TimeoutManager] INFO org.codelibs.fess.timer.SystemMonitorTarget - [SYSTEM MONITOR] {“os”:{“memory”:{“physical”:{“free”:7372025856,“total”:17034055680},“swap_space”:{“free”:6533300224,“total”:19584192512}},“cpu”:{“percent”:29},“load_averages”:null},“process”:{“file_descriptor”:{“open”:-1,“max”:-1},“cpu”:{“percent”:0,“total”:41765},“virtual_memory”:{“total”:562368512}},“jvm”:{“memory”:{“heap”:{“used”:177432408,“committed”:257490944,“max”:518979584,“percent”:34},“non_heap”:{“used”:85916928,“committed”:89808896}},“pools”:{“direct”:{“count”:40,“used”:85991441,“capacity”:85991440},“mapped”:{“count”:0,“used”:0,“capacity”:0}},“gc”:{“young”:{“count”:10,“time”:136},“old”:{“count”:2,“time”:51}},“threads”:{“count”:61,“peak”:62},“classes”:{“loaded”:10632,“total_loaded”:10635,“unloaded”:3},“uptime”:214725},“elasticsearch”:null,“timestamp”:1571208852472}
2019-10-16 12:24:19,115 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 2ms}, Mem:{used 169MB, heap 245MB, max 494MB})
2019-10-16 12:24:29,116 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 170MB, heap 245MB, max 494MB})
2019-10-16 12:24:39,116 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 3ms}, Mem:{used 170MB, heap 245MB, max 494MB})
2019-10-16 12:24:39,708 [WebFsCrawler] INFO org.codelibs.fess.helper.WebFsIndexHelper - [EXEC TIME] crawling time: 210961ms
2019-10-16 12:24:49,116 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - Processing no docs (Doc:{access 2ms}, Mem:{used 171MB, heap 245MB, max 494MB})
2019-10-16 12:24:49,116 [IndexUpdater] INFO org.codelibs.fess.indexer.IndexUpdater - [EXEC TIME] index update time: 85ms
2019-10-16 12:24:49,307 [main] INFO org.codelibs.fess.exec.Crawler - Finished Crawler
2019-10-16 12:24:49,416 [main] INFO org.codelibs.fess.exec.Crawler - [CRAWL INFO] DataCrawlEndTime=2019-10-16T12:21:08.723+0530,CrawlerEndTime=2019-10-16T12:24:49.307+0530,WebFsCrawlExecTime=210961,CrawlerStatus=true,CrawlerStartTime=2019-10-16T12:21:08.673+0530,WebFsCrawlEndTime=2019-10-16T12:24:49.307+0530,WebFsIndexExecTime=85,WebFsIndexSize=0,CrawlerExecTime=220634,DataCrawlStartTime=2019-10-16T12:21:08.704+0530,WebFsCrawlStartTime=2019-10-16T12:21:08.703+0530
2019-10-16 12:24:54,491 [main] INFO org.codelibs.fess.crawler.client.EsClient - Disconnected to elasticsearch:localhost:9300
2019-10-16 12:25:09,430 [main] INFO org.codelibs.fess.exec.Crawler - Destroyed LaContainer.

Could you please advice how to resolve this issue

(from github.com/marevol)
It’s better to check debug level logs.

(from github.com/ghost)
I have a similar issue where the crawler does not process any docs…

the fess-crawler.log file reads:

019-10-22 05:19:24,456 [WebFsCrawler] INFO Target URL: https://www.example.com/
2019-10-22 05:19:24,460 [WebFsCrawler] INFO Included URL: https://www.example.com/*
2019-10-22 05:19:25,693 [Crawler-20191022051918-1-1] INFO Crawling URL: https://www.example.com/
2019-10-22 05:19:25,779 [Crawler-20191022051918-1-1] INFO Checking URL: https://www.example.com/robots.txt
2019-10-22 05:19:34,509 [IndexUpdater] INFO Processing 1/1 docs (Doc:{access 10ms}, Mem:{used 151MB, heap 512MB, max 512MB})
2019-10-22 05:19:34,588 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 2ms, cleanup 21ms}, Mem:{used 153MB, heap 512MB, max 512MB})

that mentioned robots.txt file does not even exist on the server…

now where are those debug level logs located ? I’ll post them so that issue can be fixed

(from github.com/marevol)
See https://fess.codelibs.org/13.4/config/crawler.html#log-level-setting

(from github.com/ghost)
OK the debug log produced is HUGE , too much ti post here…

I can tell you, however that the site im’ trying to crawl uses pages ending with .php i could see in the debug log that fess is adding a lot of these urls as a “child”. But at the end of the crawling process only 5 pages are crawled. In the Web crawling configuration , I set the Depth = 30 and max access count to 999…

the site itself is entered as follows:

URLs https://example.com/
Included URLs For Crawling https://example.com/.*
Excluded URLs For Crawling
Included URLs For Indexing https://example.com/.*
Excluded URLs For Indexing

what could be wrong ?

(from github.com/ghost)
I think I figured out a part of the problem:
the URL of the website that i want indexed and crawled reads https://example.com and I entered https://www.example.com

now the thing is: that Website uses www.example.com and example.com (without the www prefix) interchangeably, so i need to tell the crawler to start at https://example.com but ALSO crawl and index items beginning with http://www.example.com/

how would I configure the crawler here ?

(from github.com/marevol)
See Duplicate Host.

(from github.com/GeetaLakhwani-1)
After changed the log setting ,got below exceptions
[CoreLib-TimeoutManager] DEBUG org.codelibs.fess.timer.SystemMonitorTarget - Failed to access Elasticsearch stats.
com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value
at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:1897) ~[jackson-core-2.8.11.jar:2.8.11]
at com.fasterxml.jackson.core.json.UTF8JsonGenerator.writeFieldName(UTF8JsonGenerator.java:185) ~[jackson-core-2.8.11.jar:2.8.11]
at org.elasticsearch.common.xcontent.json.JsonXContentGenerator.writeFieldName(JsonXContentGenerator.java:181) ~[elasticsearch-x-content-6.3.2.jar:6.3.2]
at org.elasticsearch.common.xcontent.XContentBuilder.field(XContentBuilder.java:281) ~[elasticsearch-x-content-6.3.2.jar:6.3.2]
at org.elasticsearch.common.xcontent.XContentBuilder.startObject(XContentBuilder.java:257) ~[elasticsearch-x-content-6.3.2.jar:6.3.2]
at org.elasticsearch.action.admin.cluster.node.stats.NodesStatsResponse.toXContent(NodesStatsResponse.java:56) ~[elasticsearch-6.3.2.jar:6.3.2]
at org.codelibs.fess.timer.SystemMonitorTarget.appendElasticsearchStats(SystemMonitorTarget.java:190) [classes/:?]
at org.codelibs.fess.timer.SystemMonitorTarget.expired(SystemMonitorTarget.java:82) [classes/:?]
at org.codelibs.core.timer.TimeoutTask.expired(TimeoutTask.java:113) [corelib-0.4.0.jar:?]
at org.codelibs.core.timer.TimeoutManager.run(TimeoutManager.java:141) [corelib-0.4.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

Actually there is a huge of logs but mainly got the above exception

(from github.com/marevol)
Partial information is not helpful…