Enable OCR only on specific file crawling

discuss · August 19, 2017, 8:25am

(from github.com/freestyle68)
Is it possible to disable/enable OCR for specific file crawling using config parameters?

Because actually the one option is to have OCR globally enabled or disabled.

discuss · August 19, 2017, 12:58pm

(from github.com/marevol)
Fess Crawler invokes any command to extract a text from a crawled file.
So, if you have a executable command for OCR, it’ll work.

To add CommandExtractor, for example,

Copy extractor.xml to app/WEB-INF/classes/crawler (select a proper Fess Crawler version)
Add CommandExtractor to extractor.xml

	<component name="ocrExtractor"
		class="org.codelibs.fess.crawler.extractor.impl.CommandExtractor">
		<property name="command">"ocrcmd $INPUT_FILE $OUTPUT_FILE"</property>
	</component>

Modify extractor.xml

	<component name="extractorFactory"
		class="org.codelibs.fess.crawler.extractor.ExtractorFactory">
		<postConstruct name="addExtractor">
			<arg>["image/jpeg"]</arg>
			<arg>ocrExtractor</arg>
		</postConstruct>
...

discuss · August 21, 2017, 2:25pm

(from freestyle68 (Freestyle) · GitHub)
ok, I’ll try it next days.

I want to explain better my questions about OCR:
until version 11.0.3 (and perhaps 11.1) the OCR was automatic with tesseract installed. So to disable OCR a deinstallation of tesseract was necessary (or this soluzion in the bottom: TikaOCR - TIKA - Apache Software Foundation )

In that release Tika was 1.14 version.
Now Fess has 1.16, and OCR never start. Also editing the config in

https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties

putting enableImageProcessing=1, compiling and copying the new generated tika-parsers-1.16.jar to fess libs has no effect on images.
Tesseract is working correctly, in fact launching tika-app-1.16.jar (Apache Download Mirrors ) via commanline generate the OCR text.

Did Fess disable OCR by Tika lib (tika-parsers-1.16.jar) by design?

And regarding the initial question,

Is it possible to disable/enable OCR for specific file crawling using config parameters?

I was asking if is possible to disable or enable OCR for a single crawling configuration with some Config Parameters. In this way the OCR would be enabled/disabled only for specific path and not globally. And the OCR would be done by Tika as usual.

With your proposed solution I would be forced to edit extractor.xml path by path to enable OCR.

discuss · August 22, 2017, 12:59pm

(from github.com/marevol)
Fess does not pass TesseractOCRConfig instance to Tika at the moment(Fess does not use Tika directly…).
I’ll support it in the future release…