Diskspace

discuss · January 8, 2019, 4:21pm

(from github.com/Anders-Bergqvist)
When crawled about 21 000 docs I have 1.19GB in store. The diskspace says there are only 3,72GB free of 14.68GB total. Fess and ES is the only thing running on this server. Is this normal?

discuss · January 8, 2019, 9:03pm

(from github.com/marevol)
What is a problem?
It depends on your environment.

discuss · January 9, 2019, 9:28am

(from github.com/zackhorvath)
@Anders-Bergqvist Might be related, but I ran into a similar problem filling up a 100GB disk. Turns out it was the thumbnail service. If you aren’t using it, turn off the thumbnail scheduler and clean out your thumbnails directory (deb package location @ /usr/share/fess/app/WEB-INF/thumbnails)

discuss · January 9, 2019, 10:07am

(from github.com/Anders-Bergqvist)
@zackhorvath Yes that is a good lead! Today I had the red bar in the Dashboard and Elastic says “3 unassigned shards” refering to crawler.data and .fess_config.thumbnail_queue

The fess.log says:
2019-01-09 00:01:34,648 [job_thumbnail_generate] ERROR Failed to generate thumbnails. org.codelibs.fess.exception.FessSystemException: Exit Code: 1 Output: Jan 09, 2019 12:01:34 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.

discuss · January 9, 2019, 11:10am

(from github.com/Anders-Bergqvist)
When I looked into /usr/share/fess/app/WEB-INF/thumbnails it was empty. So that was a dead end!

discuss · January 9, 2019, 2:11pm

(from github.com/burple6)
You might check for java heapdump files. These can get created if you run out of allocated memory during the crawl, the suggest indexer, or the thumbnail generator. On my system executing:

ls -l /usr/share/fess/app/java_pid*

Showed me a LOT of these large files before I started tuning the java memory parameters in /etc/fess/fess_config.properties.

discuss · January 9, 2019, 4:17pm

(from github.com/Anders-Bergqvist)
@burple6 What parameters did you end up with that stabalized the environment?

discuss · January 9, 2019, 4:30pm

(from github.com/burple6)
For me it was dependent on a variety of things, so I don’t think a canned answer would serve you well.

Basically, to arrive at the settings I have now, I took into account roughly how large the whole “crawl” was going to be (based on experiments crawling it in earlier efforts). I also decided how large the biggest indexed document would be. Taking those things into consideration, on a server with 32GB of RAM I arrived at the following settings which seem stable for our content:
/etc/sysconfig/fess => FESS_HEAP_SIZE=2048m

/etc/fess/fess_config.properties => jvm.crawler.options should contain -Xmx3g -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=128m

These settings work well for me. If anyone else out there has any suggestions on how to tweak these parameters appropriately, they would be appreciated.

Additionally, I commented out the portions that cause java to leave a heapdump -XX:+HeapDumpOnOutOfMemoryError since I do not intend to go examine the heapdump files. As mentioned, I have encountered these dumps mostly through running out of memory during a crawl. If and/when that happens, I will slightly tweak those parameters until they go away again.

IMPORTANT: I strongly recommend against just throwing a bunch of RAM at the issue right up front. This can cause Java to avoid garbage collection routines and ultimately lead to poor performance and crashes. The practice of gradually bumping up the memory in small increments as needed has served me well. YMMV

discuss · January 11, 2019, 9:55am

(from github.com/Anders-Bergqvist)
Thanks @burple6! We made some adjustments according to this. And the diskspace was not a dead end (thanks @zackhorvath). We had our thumbnails stored in a different location, so I stopped the generator and cleaned it out. The thing is I would like the crawler to run faster. Is it only the settings “The number of Threads” and “Interval Time” who controls this, given the system resources are sufficient? Now I’ve set it to 3 simultaneous threads on one crawl and 2 on the other (we have only two). The crawl rate is about 10 docs in one and a half minute. We have maybe 21 000 docs and I want to crawl that amount every 24 hours.

discuss · January 16, 2019, 10:11pm

(from github.com/zackhorvath)
@Anders-Bergqvist glad to hear you managed to track that folder location down! Out of curiosity, are you running the zip version? I’m working on some guides that might help some people out

discuss · January 17, 2019, 4:58pm

(from github.com/Anders-Bergqvist)
@zackhorvath I initially ran booth a Windows-server with the budle from the zip-file and a Linux server with the separate Elastic and Fess 12.4.3. Now when we are about to launch we are proceeding with the Linux-installations and it was the Linux environment this question-thread was about. If you are about to publish some guides I’m surely interested!