How to crawl 100MB files

Since more than a week I was working on fess to handle 100MB PDF files. Now it works. The following settings need to be done to run Fess stable w/o JavaOutOfMemory errors or crashes on a low end Intel Celeron J1900 with 8GB RAM under Ubuntu.

  1. Step: You need RAM. 8GB is not enough. Additional 4GB RAM or virtual Memory is minimum, better more!

  2. Set Threads under File Crawler Configuration to 3 and to 3 parallel Crawlers under General Configuration.

  3. Deactivate all logging features

  4. Deactivate Thumbnail creation

  5. Configure Crawler schedules not to run more than 2 Crawlers in parallel. I have three Crawlers for different folders and they run on different times.

  6. Under Crawler configuration parameter insert: max_size=104857600

  7. In fess/app/WEB-INF/classes/crawler/contentlength.xml change
    to 104857600 and “text/html” 20971520.

  8. In /fess/app/WEB-INF/classes/fess_config.properties change to -Xms512m\n\ and -Xmx2g\n\

  9. In /fess/bin/fess.in.sh change to
    if [ “x$FESS_MIN_MEM” = “x” ]; then FESS_MIN_MEM=512m and if [ “x$FESS_MAX_MEM” = “x” ]; then FESS_MAX_MEM=3g in line 13 and 16.

I also closed all applications like Firefox to save RAM and run the following script in sudo crontab -e to reduce CPU load after reboot:

#!/bin/bash
for ((;;))
do
renice 19 $(pidof java)
sleep 60
done

Start in sudo crontab -e

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
@reboot sleep 120 && bash /home/xyz/bin/javalowcpuprio.sh

Good luck!

  1. In /fess/bin/fess.in.sh change to
    if [ “x$FESS_MIN_MEM” = “x” ]; then FESS_MIN_MEM=512m and if [ “x$FESS_MAX_MEM” = “x” ]; then FESS_MAX_MEM=3g in line 13 and 16.

This setting is for a web app, not a crawler.
So, I do not think it’s needed.

Hi Shinsuke,

the values change the Heap size in Dashboard. In line 19&20 the values are also used for heap:

FESS_MIN_MEM=$FESS_HEAP_SIZE
FESS_MAX_MEM=$FESS_HEAP_SIZE

Is there a better way to increase the Heap Size and reduce the web app memory usage? Currently Fess is really a high weight application on my low end PC. E.g. DocFetcher (http://docfetcher.sourceforge.net/) does the same w/o such high memory requirements, but has no web user interface.

FESS_*_MEM is for a process of the web app and should not be changed.
The memory consumption is on a process of crawler and elasticsearch when crawling a large size file.

If I don’t change this value I just have 2GB Heap Memory (in Dashboard) and from time to time fess needs more than 2GB. Fess crashes or I have a JaveOutOfMemory error message and Crawler stops.

Do you have a better idea, to increase max heap memory?

It’s a heap memory of elasticsearch, not Fess.
You need to change it in elasticsearch.

Ah ok. I didn’t install it because I thought Fess is enough for me. Thank you!