(from github.com/freestyle68)
Is it possible to disable/enable OCR for specific file crawling using config parameters?
Because actually the one option is to have OCR globally enabled or disabled.
(from github.com/freestyle68)
Is it possible to disable/enable OCR for specific file crawling using config parameters?
Because actually the one option is to have OCR globally enabled or disabled.
(from github.com/marevol)
Fess Crawler invokes any command to extract a text from a crawled file.
So, if you have a executable command for OCR, it’ll work.
To add CommandExtractor, for example,
<component name="ocrExtractor"
class="org.codelibs.fess.crawler.extractor.impl.CommandExtractor">
<property name="command">"ocrcmd $INPUT_FILE $OUTPUT_FILE"</property>
</component>
<component name="extractorFactory"
class="org.codelibs.fess.crawler.extractor.ExtractorFactory">
<postConstruct name="addExtractor">
<arg>["image/jpeg"]</arg>
<arg>ocrExtractor</arg>
</postConstruct>
...
(from freestyle68 (Freestyle) · GitHub)
ok, I’ll try it next days.
I want to explain better my questions about OCR:
until version 11.0.3 (and perhaps 11.1) the OCR was automatic with tesseract installed. So to disable OCR a deinstallation of tesseract was necessary (or this soluzion in the bottom: TikaOCR - TIKA - Apache Software Foundation )
In that release Tika was 1.14 version.
Now Fess has 1.16, and OCR never start. Also editing the config in
putting enableImageProcessing=1, compiling and copying the new generated tika-parsers-1.16.jar to fess libs has no effect on images.
Tesseract is working correctly, in fact launching tika-app-1.16.jar (Apache Download Mirrors ) via commanline generate the OCR text.
Did Fess disable OCR by Tika lib (tika-parsers-1.16.jar) by design?
And regarding the initial question,
Is it possible to disable/enable OCR for specific file crawling using config parameters?
I was asking if is possible to disable or enable OCR for a single crawling configuration with some Config Parameters. In this way the OCR would be enabled/disabled only for specific path and not globally. And the OCR would be done by Tika as usual.
With your proposed solution I would be forced to edit extractor.xml path by path to enable OCR.
(from github.com/marevol)
Fess does not pass TesseractOCRConfig instance to Tika at the moment(Fess does not use Tika directly…).
I’ll support it in the future release…
© 2020. All Rights Reserved - CodeLibs, Inc.