Indexed documents vs crawled documents

discuss · September 8, 2016, 12:02pm

Any idea why i have only a small number of indexed documents compared to crawled documents.
Fess is showing i have 200k+ documents but in reality i only have 12k+ in my index that i can search. I cannot search for all 200k documents.

The crawling job has finished and there is nothing else happening. How can i check what happened to the other documents, i have tried the logs already. At this point this is the biggest issue i`m facing , how can i get that number closer to the number of crawled documents

.

discuss · September 8, 2016, 12:55pm

(from github.com/marevol)
To check other dot indices, enable special checkbox.

discuss · September 8, 2016, 12:57pm

(from github.com/attibalazs)

the issue is when i search i only search on 12k documents. Thats how many are in the fess index, i have only one fess.* index. In the crawler index i have 200k documents, which is also shown in the dashboard.

Why don`t i have more documents in the fess index why only 12k?

discuss · September 8, 2016, 1:41pm

(from github.com/marevol)
Could you try to delete .crawler index before starting crawler?

discuss · September 8, 2016, 1:45pm

(from github.com/attibalazs)
will do. Is there a way to see what the crawler is doing? how do you debug the crawler?

discuss · September 8, 2016, 1:55pm

(from github.com/marevol)
To do a remote debug, in Admin > System > Scheduler > Default Crawler, change script to

return container.getComponent("crawlJob").logLevel("info").remoteDebug().execute(executor);

and also change a log level to “debug”.

discuss · September 8, 2016, 3:32pm

(from github.com/attibalazs)
i’ve delete the crawler index and applied the settings. let`s see what happens it will take a few hours to run.

discuss · September 9, 2016, 8:07am

(from attibalazs (Atti) · GitHub)
This is the error i got crom the crawler. Looks like it ran out of memory. I am running fess with an increased heap space of 10GB but is there a way to increase the heap space for the crawler ?

2016-09-08 21:25:34,917 [IndexUpdater] ERROR IndexUpdater is terminated.
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68)
at java.lang.StringBuilder.(StringBuilder.java:101)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:346)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2415)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:285)
at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:84)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:299)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:274)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:314)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:274)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:245)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.map(AbstractXContentParser.java:208)
at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:83)
at org.elasticsearch.search.lookup.SourceLookup.sourceAsMapAndType(SourceLookup.java:88)
at org.elasticsearch.search.lookup.SourceLookup.sourceAsMap(SourceLookup.java:92)
at org.elasticsearch.index.get.GetResult.sourceAsMap(GetResult.java:177)
at org.elasticsearch.index.get.GetResult.getSource(GetResult.java:182)
at org.elasticsearch.action.get.GetResponse.getSource(GetResponse.java:133)
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.get(AbstractCrawlerService.java:308)
at org.codelibs.fess.crawler.service.impl.EsDataService.getAccessResult(EsDataService.java:76)
at org.codelibs.fess.crawler.entity.EsAccessResult.getAccessResultData(EsAccessResult.java:77)
at org.codelibs.fess.indexer.IndexUpdater.processAccessResults(IndexUpdater.java:346)
at org.codelibs.fess.indexer.IndexUpdater.run(IndexUpdater.java:230)
2016-09-08 21:25:34,918 [IndexUpdater] INFO [EXEC TIME] index update time: 604404ms
2016-09-08 21:25:34,986 [Crawler-20160908163050-1-4] ERROR Crawling Exception at file:////temp/
> java.lang.IllegalStateException: Future got interrupted
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:72)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:62)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:52)
at org.codelibs.fess.es.client.FessEsClient.get(FessEsClient.java:595)
at org.codelibs.fess.es.client.FessEsClient.getDocument(FessEsClient.java:709)
at org.codelibs.fess.es.client.FessEsClient.getDocument(FessEsClient.java:681)
at org.codelibs.fess.helper.IndexingHelper.getDocument(IndexingHelper.java:133)
at org.codelibs.fess.crawler.FessCrawlerThread.isContentUpdated(FessCrawlerThread.java:113)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:155)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:259)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:68)
… 9 common frames omitted

discuss · September 9, 2016, 1:39pm

(from github.com/marevol)
JVM options for Crawler are in fess_config.properties:

jvm.crawler.options=\
-Djava.awt.headless=true\n\
-server\n\
-Xmx512m\n\
-XX:MaxMetaspaceSize=128m\n\
-XX:CompressedClassSpaceSize=32m\n\
-XX:-UseGCOverheadLimit\n\
-XX:+UseConcMarkSweepGC\n\
-XX:CMSInitiatingOccupancyFraction=75\n\
-XX:+UseParNewGC\n\
-XX:+UseTLAB\n\
-XX:+DisableExplicitGC\n\
-XX:-OmitStackTraceInFastThrow\n\
-Djcifs.smb.client.connTimeout=60000\n\
-Djcifs.smb.client.soTimeout=35000\n\
-Djcifs.smb.client.responseTimeout=30000\n\
-Dgroovy.use.classvalue=true\n\