OCR: "not modified" unavailable

discuss · March 23, 2019, 12:27am

using Fess last commit, followed your instructions in

https://qiita.com/shinsuke_sugaya/items/a02b102b02fcb8e6c922 for OCR settings.

All is working correctly, but facing a problem that makes OCR practically useless.

It happens that, for every daily PDFs recrawl, the files are totally reprocessed by tesseract. With 300000 files it is impossible to pass all files daily because the whole task takes several days.

Looking at fess-crawler.log the usual line “Not modified” for recrawls does not appear for pdfs.
With other NON pdf files there isn’t this problem and they aren’t indexed again.

On Fess settings “Check Last Modified” is enabled.

There is a way to restore the usual behavior using OCR and process pdfs only the first time?

Thank you

discuss · March 23, 2019, 7:45am

(from github.com/marevol)
Try to remove ModDate=last_modified:pdf_date\n\ in fess_config.properties.

crawler.metadata.name.mapping=\
title=title:string\n\
Title=title:string\n\
Last-Save-Date=last_modified:date\n\
Last-Modified=last_modified:date\n\
ModDate=last_modified:pdf_date\n\

discuss · March 23, 2019, 4:01pm

(from github.com/freestyle68)
Tried and deleted the index but it doesn’t change anything.

discuss · March 24, 2019, 10:53pm

(from github.com/marevol)
Could you check last_modified field of the indexed document in Admin Search page and Last-Modified response header(or a timestamp of the file)?

discuss · March 25, 2019, 4:32pm

(from github.com/freestyle68)
With Fess factory version

from admin search page:

last_modified: 2017-12-03T23:38:50.000Z
timestamp: 2017-12-03T23:38:50.000Z

File timestamp (stat file.pdf commandline): 2017-12-03 23:38:50.000000000

With ocr without removing “ModDate=last_modified:pdf_date\n” :

from admin search page:

last_modified: 2004-06-07T16:39:54.000Z
timestamp: 2004-06-07T16:39:54.000Z

(these also are the metadata I get from tika-app)

With ocr removing “ModDate=last_modified:pdf_date\n” :

from admin search page:

last_modified: 2004-06-07T16:39:54.000Z
timestamp: 2004-06-07T16:39:54.000Z

discuss · March 26, 2019, 6:38pm

(from github.com/freestyle68)
I did same tests after #2065 fix.

Using your OCR settings as described in https://qiita.com/shinsuke_sugaya/items/a02b102b02fcb8e6c922 there is the same problem: last_modified is picked from PDF metadata (and not from file timestamp) and all pdfs are totally reprocessed by tesseract for every recrawl.

But removing

Last-Save-Date=last_modified:date\n\
Last-Modified=last_modified:date\n\
ModDate=last_modified:pdf_date\n\

from fess_config.properties, last_modified become equal to file timestamp and tesseract is not active on recrawling. So in this way things are going well also for ocr.

However, removing this lines is a regression because fix #2019 was introduced to support last_modified metadata for NON pdf files, so this is the wrong way for doing ocr.

discuss · March 26, 2019, 8:29pm

(from github.com/marevol)
This issue is not related to #2065.
#2019 uses metadata as last_modified. So, for PDF and MS Office files, incremental crawling will not work if metadata does not match file timestamp.
Since it may confuse users, I’ll remove these settings…