Enable OCR only on specific file crawling

(from github.com/freestyle68)
Is it possible to disable/enable OCR for specific file crawling using config parameters?

Because actually the one option is to have OCR globally enabled or disabled.

(from github.com/marevol)
Fess Crawler invokes any command to extract a text from a crawled file.
So, if you have a executable command for OCR, it’ll work.

To add CommandExtractor, for example,

  1. Copy extractor.xml to app/WEB-INF/classes/crawler (select a proper Fess Crawler version)
  2. Add CommandExtractor to extractor.xml
	<component name="ocrExtractor"
		<property name="command">"ocrcmd $INPUT_FILE $OUTPUT_FILE"</property>
  1. Modify extractor.xml
	<component name="extractorFactory"
		<postConstruct name="addExtractor">

(from github.com/freestyle68)
ok, I’ll try it next days.

I want to explain better my questions about OCR:
until version 11.0.3 (and perhaps 11.1) the OCR was automatic with tesseract installed. So to disable OCR a deinstallation of tesseract was necessary (or this soluzion in the bottom: https://wiki.apache.org/tika/TikaOCR )

In that release Tika was 1.14 version.
Now Fess has 1.16, and OCR never start. Also editing the config in


putting enableImageProcessing=1, compiling and copying the new generated tika-parsers-1.16.jar to fess libs has no effect on images.
Tesseract is working correctly, in fact launching tika-app-1.16.jar (http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.16.jar ) via commanline generate the OCR text.

Did Fess disable OCR by Tika lib (tika-parsers-1.16.jar) by design?

And regarding the initial question,

Is it possible to disable/enable OCR for specific file crawling using config parameters?

I was asking if is possible to disable or enable OCR for a single crawling configuration with some Config Parameters. In this way the OCR would be enabled/disabled only for specific path and not globally. And the OCR would be done by Tika as usual.

With your proposed solution I would be forced to edit extractor.xml path by path to enable OCR.

(from github.com/marevol)
Fess does not pass TesseractOCRConfig instance to Tika at the moment(Fess does not use Tika directly…).
I’ll support it in the future release…