Fess Filesystem Processing no docs in indexing queue

mark_dg68asjf · August 22, 2024, 8:56am

Hi,

I have some troubles to crawl some files. I am using the latest version of fess, which is 14.16.0 with opensearch 2.16.0 on a Win11 PC

Originally I wanted to crawl/index some files from a test directory on my NAS, but it said “INFO Processing no docs in indexing queue” etc.

So I tried to copy the folder on my local PC where Fess is running and tried with a new Filesystem-crawler called “Filesystem_test_local”

The folder with all my files(pdf, txt, xlsx etc.) is on “C:\Fess_test_local” so I configured the “Filesystem_test_local”-crawler with the following:

ID: bCz3eJEB5jT5m4cO_QC_
Name: Filesystem_test_local
Paths: file:///C:/Fess_test_local/*
Included Paths For Crawling:
file:///C:/Fess_test_local/*
.*/$
.*pdf$
.*txt$
.*xlsx$
.*docx$
Included Paths For Indexing:
.*/$
.*pdf$
.*txt$
.*xlsx$
.*docx$

Then I created a Job in the scheduler with the following parameters:

Name: Filesystem_test_local

Target: all

Schedule:

Executor: groovy

Script: return container.getComponent(crawlJob).logLevel(info).webConfigIds( as String).fileConfigIds([bCz3eJEB5jT5m4cO_QC_] as String).dataConfigIds( as String).jobExecutor(executor).execute();

Logging: Enabled

Crawler Job: Enabled

Status: Enabled

Display Order: 0

I started the Job manually and after a short time the Job was Ok, but the Index size in System Info → Crawling Info was 0

And in the fess-crawler.log I got the following:

2024-08-22 09:52:28,466 [WebFsCrawler] INFO Target Path: file:///C:/Fess_test_local/*
2024-08-22 09:52:28,466 [WebFsCrawler] INFO Included Path: file:///C:/Fess_test_local/*
2024-08-22 09:52:28,466 [WebFsCrawler] INFO Included Path: .*/$
2024-08-22 09:52:28,466 [WebFsCrawler] INFO Included Path: .*pdf$
2024-08-22 09:52:28,467 [WebFsCrawler] INFO Included Path: .*txt$
2024-08-22 09:52:28,467 [WebFsCrawler] INFO Included Path: .*xlsx$
2024-08-22 09:52:28,467 [WebFsCrawler] INFO Included Path: .*docx$
2024-08-22 09:52:38,485 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 6ms}, Mem:{used 144.348MB, heap 372.736MB, max 524.288MB})
2024-08-22 09:52:48,479 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 4ms}, Mem:{used 147.408MB, heap 372.736MB, max 524.288MB})
2024-08-22 09:52:58,481 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 5ms}, Mem:{used 149.831MB, heap 372.736MB, max 524.288MB})
2024-08-22 09:52:58,923 [WebFsCrawler] INFO [EXEC TIME] crawling time: 30518ms
2024-08-22 09:53:08,480 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 2ms}, Mem:{used 149.998MB, heap 372.736MB, max 524.288MB})
2024-08-22 09:53:08,480 [IndexUpdater] INFO [EXEC TIME] index update time: 26ms
2024-08-22 09:53:08,553 [main] INFO Finished Crawler
2024-08-22 09:53:08,599 [main] INFO [CRAWL INFO] CrawlerEndTime=2024-08-22T09:53:08.553+0200,WebFsCrawlExecTime=30518,CrawlerStatus=true,CrawlerStartTime=2024-08-22T09:52:28.385+0200,WebFsCrawlEndTime=2024-08-22T09:53:08.552+0200,WebFsIndexExecTime=26,WebFsIndexSize=0,CrawlerExecTime=40168,WebFsCrawlStartTime=2024-08-22T09:52:28.397+0200
2024-08-22 09:53:08,603 [main] INFO Disconnected to http://localhost:9201
2024-08-22 09:53:08,606 [main] INFO Destroyed LaContainer.

I do not know why he says that “Processing no docs in indexing queue”, since there are files in my Fess_test_local folder and if I go to the Browser and type the URL file:///C:/Fess_test_local/name_of_pdf.pdf I can see a pdf from that folder.

mark_dg68asjf · August 22, 2024, 11:38am

I also tried to check the debug level logs by changing .logLevel(“info”) to .logLevel(“debug”) at the Job and got this:

and so on… it is a long log file

shinsuke · August 22, 2024, 12:07pm

2024-08-22 09:52:28,466 [WebFsCrawler] INFO Target Path: file:///C:/Fess_test_local/*

The target path should be “file:///C:/Fess_test_local/”.

mark_dg68asjf · August 23, 2024, 7:00am

Thank you very much! I can now crawl and search files, but I have the Problem now that I can only crawl 6/7 of the files…

At the last docx the crawl.log says: “java.lang.OutOfMemoryError: Java heap space”
Here the entire log:

ERROR Crawling Exception at file:/C:/Fess_test_local/test.docx
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.HashMap$Values.iterator(HashMap.java:1043) ~[?:?]
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.close(ZipInputStreamZipEntrySource.java:117) ~[poi-ooxml-5.2.5.jar:5.2.5]
at org.apache.poi.openxml4j.opc.ZipPackage.revertImpl(ZipPackage.java:556) ~[poi-ooxml-5.2.5.jar:5.2.5]
at org.apache.poi.openxml4j.opc.OPCPackage.revert(OPCPackage.java:524) ~[poi-ooxml-5.2.5.jar:5.2.5]
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:262) ~[tika-parser-microsoft-module-2.9.2.jar:2.9.2]
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) ~[tika-parser-microsoft-module-2.9.2.jar:2.9.2]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.2.jar:2.9.2]
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor$TikaDetectParser.parse(TikaExtractor.java:533) ~[fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.lambda$getText$0(TikaExtractor.java:204) ~[fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor$$Lambda$1082/0x0000023cca632f00.accept(Unknown Source) ~[?:?]
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.getContent(TikaExtractor.java:430) ~[fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.getText(TikaExtractor.java:193) ~[fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.getText(TikaExtractor.java:149) ~[fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.getExtractData(AbstractFessFileTransformer.java:391) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.generateData(AbstractFessFileTransformer.java:101) ~[classes/:?]
at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:82) ~[classes/:?]
at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:74) ~[fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:291) [fess-crawler-14.16.0.jar:?]
at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:251) ~[classes/:?]
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:162) [fess-crawler-14.16.0.jar:?]
at java.base/java.lang.Thread.run(Thread.java:842) [?:?]
2024-08-23 08:30:06,063 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 9ms, cleanup 136ms}, Mem:{used 132.304MB, heap 447.488MB, max 524.288MB})
2024-08-23 08:30:16,064 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 10ms, cleanup 136ms}, Mem:{used 135.049MB, heap 447.488MB, max 524.288MB})
2024-08-23 08:30:26,072 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 4ms, cleanup 136ms}, Mem:{used 136.949MB, heap 447.488MB, max 524.288MB})
2024-08-23 08:30:29,830 [WebFsCrawler] INFO [EXEC TIME] crawling time: 43879ms
2024-08-23 08:30:36,079 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 10ms, cleanup 136ms}, Mem:{used 137.332MB, heap 447.488MB, max 524.288MB})
2024-08-23 08:30:36,079 [IndexUpdater] INFO [EXEC TIME] index update time: 387ms
2024-08-23 08:30:36,230 [main] INFO Finished Crawler

Is it the problem of the Max Memory? If so, at which file should I change the -Xmx setting if I am running Fess as a service? (I am running fess trough the service.bat)

mark_dg68asjf · August 23, 2024, 11:47am

Oh and I am also having trouble connecting with the NAS… same Problem as the original Post but this time the folder is located on: “Z:\NAS_03\Fess_test”

when trying to put the same settings for the filecrawler as the local filecrawler I changed the Path to: “smb:///Z:/NAS_03/Fess_test/”

Then I configured the File Authentication as follows:

but in the log I got:

INFO Target Path: smb:///Z:/NAS_03/Fess_test/
2024-08-23 13:39:56,675 [WebFsCrawler] INFO Included Path: smb:///Z:/NAS_03/Fess_test/
2024-08-23 13:39:56,675 [WebFsCrawler] INFO Included Path: .*/$
2024-08-23 13:39:56,675 [WebFsCrawler] INFO Included Path: .*pdf$
2024-08-23 13:39:56,675 [WebFsCrawler] INFO Included Path: .*txt$
2024-08-23 13:39:56,675 [WebFsCrawler] INFO Included Path: .*xlsx$
2024-08-23 13:39:56,780 [Crawler-20240823133954-1-1] INFO Crawling URL: smb:///Z:/NAS_03/Fess_test/
2024-08-23 13:40:02,953 [Crawler-20240823133954-1-1] INFO [A7_5fpEBhuGM7pkiNFor] Could not access smb:///Z:/NAS_03/Fess_test/
2024-08-23 13:40:06,695 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 6ms}, Mem:{used 157.312MB, heap 357.376MB, max 524.288MB})
2024-08-23 13:40:16,688 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 4ms}, Mem:{used 159.998MB, heap 357.376MB, max 524.288MB})
2024-08-23 13:40:26,689 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 5ms}, Mem:{used 162.785MB, heap 357.376MB, max 524.288MB})
2024-08-23 13:40:34,113 [WebFsCrawler] INFO [EXEC TIME] crawling time: 37502ms
2024-08-23 13:40:36,690 [IndexUpdater] INFO Processing no docs in indexing queue (Doc:{access 6ms}, Mem:{used 163.226MB, heap 357.376MB, max 524.288MB})
2024-08-23 13:40:36,691 [IndexUpdater] INFO [EXEC TIME] index update time: 30ms
2024-08-23 13:40:36,770 [main] INFO Finished Crawler

How should I do the path for my NAS on the which can be seen on the Networkadress?
Thank you in advance!

shinsuke · August 23, 2024, 11:26pm

Please increase the Java heap memory for the crawler or decrease the number of threads.

For the SMB protocol, the format should be smb://hostname/folder/.

mark_dg68asjf · August 26, 2024, 11:37am

Thank you very much for your help! I have solved the problem with the memory and I can now search all files in the local folder.

However, I still have a problem with the smb folder… I have now changed the path to “smb://RS3621xs/NAS_03/Fess_test/” and also entered the correct hostname in the file authentication, but I still can’t connect, but instead of the usual message that it can’t find the path, I now get the message:

org.codelibs.fess.crawler.exception.CrawlingAccessException: Could not access smb://RS3621xs/NAS_03/Fess_test/
at org.codelibs.fess.crawler.client.smb.SmbClient.getResponseData(SmbClient.java:321)
at org.codelibs.fess.crawler.client.smb.SmbClient.processRequest(SmbClient.java:161)
at org.codelibs.fess.crawler.client.smb.SmbClient.doGet(SmbClient.java:144)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:128)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:154)
at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad password.

I have checked the user name and password and even created a new user with all rights on my NAS, but I was unsuccessful. I have also checked the firewall settings and enabled everything there for an smb connection.

Is there anything I could do to find out more about the problem, or can you think of anything else I could try? Thank you very much for all your help!

shinsuke · August 26, 2024, 12:44pm

I think the username or password might be incorrect. You can verify this by using other tools, such as the smbmount command.

mark_dg68asjf · September 3, 2024, 7:53am

I still have not solved the Problem… could you maybe give me an example of the script for the Job in my Schedular? Maybe I have an error there so that he does not take the auth.

At the moment I have the following:

return container.getComponent(“crawlJob”).logLevel(“info”).webConfigIds( as String).fileConfigIds([“RizVdJEB5jT5m4cOXAAN”] as String).dataConfigIds( as String).jobExecutor(executor).execute();

where “RizVdJEB5jT5m4cOXAAN” is the ID of my Filecrawler.

Thank you very much!

shinsuke · September 3, 2024, 9:13pm

Could you provide more details to reproduce it?

mark_dg68asjf · September 4, 2024, 8:54am

I still have a problem with the smb folder,

My filecrawler is:

ID: RizVdJEB5jT5m4cOXAAN
Name: Filesystem_test
Path: smb://RS3621xs/RS3621xs/NAS_03/Fess_test/
Included Path: .*/$
.*pdf$
.*txt$
.*xlsx$

File Auth:

Hostname: RS3621xs
Port: 445
Scheme: Samba
Username: Fess
Password: ****
Parameter:
Configuration: Filesystem_test

And at the Job I have:

Name: Filesystem_test
Target: all
Schedule:
Executor: groovy

Script: return container.getComponent(“crawlJob”).logLevel(“info”).webConfigIds( as String).fileConfigIds([“RizVdJEB5jT5m4cOXAAN”] as String).dataConfigIds( as String).jobExecutor(executor).execute();

Logging: Enabled
Crawler Job: Enabled
Status: Enabled
Display Order: 0

But in the Logs I still get:

org.codelibs.fess.crawler.exception.CrawlingAccessException: Could not access smb://RS3621xs/RS3621xs/NAS_03/Fess_test/

at org.codelibs.fess.crawler.client.smb.SmbClient.getResponseData(SmbClient.java:321)
at org.codelibs.fess.crawler.client.smb.SmbClient.processRequest(SmbClient.java:161)
at org.codelibs.fess.crawler.client.smb.SmbClient.doGet(SmbClient.java:144)
at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:128)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:154)
at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad password.

But the username & password are correct, I tested with other software and even in the Windows explorer I can access the NAS.

So I thought that maybe the Script of the Job could be wrong, that it didn’t access the File-Auth and so the result would be unknown user name or bad password.
Maybe you could give me another script that works for the smb crawler, that is different than mine.

I don’t know where else the error might be.

shinsuke · September 4, 2024, 12:03pm

Did you check it using the smbmount command or the mount command with CIFS?

mark_dg68asjf · September 5, 2024, 7:42am

Yes, i can mount with the mount.cifs command

shinsuke · September 5, 2024, 8:28am

Port: 445

Please try to set the port to empty.

mark_dg68asjf · September 6, 2024, 7:03am

It works now!!! Thank you very very much I’m so happy right now

do you maybe know what the maximum file size can be? I have set the file size in contentlenght.xml to: 107374182400

But still get the error: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

at one big file

shinsuke · September 6, 2024, 12:33pm

A large file consumes a significant amount of heap memory, so you need to modify it by the crawler setting.

mark_dg68asjf · September 10, 2024, 7:10am

is there a specific ratio between the size of the file and the size required by the crawler?

I have the heap memory set to :

-Xms32g\n
-Xmx64g\n\

but still get the exception with a 13GB SQL file:
java.lang.OutOfMemoryError: Requested array size exceeds VM limit

And for a 5GB SQL file the exception:
org.codelibs.fess.crawler.exception.EsAccessException: Failed to insert 20240910000…

Is there anything else in general that I could try to change so that it crawls the files or is it due to SQL or JSON files?

shinsuke · September 10, 2024, 8:43pm

It’s not a problem with heap size. It might work by adding -XX:+UseCompressedOops.

mark_dg68asjf · September 11, 2024, 7:04am

I have added -XX:+UseCompressedOops\n\ to the fess.config_properties file, but unfortunately I still get the same errors:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:802)
at java.base/java.lang.StringBuilder.append(StringBuilder.java:246)
at java.base/java.lang.StringBuilder.append(StringBuilder.java:91)
at java.base/java.lang.AbstractStringBuilder.appendCodePoint(AbstractStringBuilder.java:947)
at java.base/java.lang.StringBuilder.appendCodePoint(StringBuilder.java:280)
at org.codelibs.fess.crawler.util.TextUtil$TextNormalizeContext.execute(TextUtil.java:91)
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.getContent(TikaExtractor.java:440)
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.getText(TikaExtractor.java:193)
at org.codelibs.fess.crawler.extractor.impl.TikaExtractor.getText(TikaExtractor.java:149)
at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.getExtractData(AbstractFessFileTransformer.java:391)
at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.generateData(AbstractFessFileTransformer.java:101)
at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:82)
at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:74)
at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:291)
at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:251)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:162)
at java.base/java.lang.Thread.run(Thread.java:842)

and

org.codelibs.fess.crawler.exception.EsAccessException: Failed to insert 20240911085619-1.Zmletc
at org.codelibs.fess.crawler.service.impl.AbstractCrawlerService.insert(AbstractCrawlerService.java:240)
at org.codelibs.fess.crawler.service.impl.EsDataService.store(EsDataService.java:60)
at org.codelibs.fess.crawler.service.impl.EsDataService.store(EsDataService.java:41)
at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.processResult(DefaultResponseProcessor.java:124)
at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:79)
at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:291)
at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:251)
at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:162)
at java.base/java.lang.Thread.run(Thread.java:842)

shinsuke · September 11, 2024, 8:37pm

Additional JVM options tuning might be needed. If you need more help, it’s better to contact commercial support.