enabling OCR with tesseract

(from github.com/biffbaxter)
So it appears this issue has not been revisited for a bit, but I am trying to get OCR working for all files. I have tesseract installed, I have a good config file, I have seen in previously closed tickets some info about OCR working as long as tesseract is installed, but I have tried to invoke the config path in the parameters for the crawler, and also have tried some things in the extractor.xml to no avail so far. (do I need to an args for each file type, or is there a global setting?)

A tip, example, or something related to that would be helpful…thanks much

(from github.com/marevol)
Did you create TesseractOCRConfig.properties and set it to a crawling config?

(from github.com/biffbaxter)
Thank you for the response. Yes I did, but I had renamed it something different, and that seemed potentially problematic. I also fixed a path typo I had and some OCR parameters in the config (I was also only trying pdf’s which I have determined is not the same of course). I can now OCR images (PNG, TIFF)…so I am getting closer. Now to OCR PDF’s. So I also passed parameters via PDFParser.properties…Is this the proper method or do I need to do something different?

Thank you.


(from github.com/biffbaxter)
Hi there…following up on the PDF OCR. If there is a particular method I should use? Any guidance would be appreciated. I saw some old info that there might be some additional things needed, but unsure if that still applies for 2018. OCR with images works, and other documents (word, excel, etc) but so far I have not gotten scanned PDF’s to read.

(from github.com/marevol)
Try to remove the following lines:

(from github.com/biffbaxter)
So far I have not had success with removing the lines mentioned. I am also a little unsure of the config parameter to pass correctly. I have tried what was mentioned previously as well as the one below and have not seen a difference. I also simplified the .properties file for the minimum items as well as a default configuration. Thank you for helping.


(from github.com/biffbaxter)
Believe this path below and syntax is correct as I can make it pass an error in the log files when I format the file improperly (that it cannot load the file). So I know its reading the file. (I tried both PDFParser, and PDFParserConfig.properties) - The file being used is the default generic pdfparser with the option of
ocrStrategy ocr_and_text_extraction

sample here - https://github.com/apache/tika/blob/master/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties

Path used:

(from github.com/biffbaxter)
Hi @marevol I am unfortunately still stuck. Any other suggestions? If I need to donate some $ to get this solved, I am happy to do so. I love all the other things fess does, but have to be able to OCR PDF’s. Thanks for the work.

(from github.com/marevol)
If you need more support, please contact Commercial Support.