PDF's attachments not indexed

discuss · January 29, 2019, 12:30am

(from github.com/freestyle68)
Hi,

actually an embedded attachment of a pdf is not indexed.
There is a workaround to fix?

Thank you

discuss · January 29, 2019, 4:05am

(from github.com/marevol)
It depends on TIka and PDFBox.

discuss · January 29, 2019, 4:57pm

(from github.com/freestyle68)
But tika app standalone extract attachments content:

with Tika app https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.20.jar

create two folders, in and out and launch

java -jar tika-app-1.20.jar -T -i /"path for in" -o /"path for out"

in the out folder I can see all attachents extracted, pdf, excel, etc.

For the attached sample.pdf I get the output sample.pdf.txt
I attach the documents.

sample.pdf
sample.pdf.txt

Why is not possible with Fess?

discuss · February 3, 2019, 9:56pm

(from github.com/marevol)
I’ll fix it in a future release…

discuss · February 6, 2019, 11:16am

(from github.com/freestyle68)
with this commmit:

I get a java.lang.ClassCastException with filesystem crawling:

Path: file:/pdfs/

Log:
org.codelibs.fess.crawler.exception.CrawlingAccessException: Could not serialize objectat org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:84)at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77)at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330)at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240)at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176)at java.base/java.lang.Thread.run(Thread.java:844)Caused by: java.lang.ClassCastException: java.base/java.lang.String cannot be cast to java.base/[Ljava.lang.Object;at org.codelibs.fess.crawler.transformer.FessTransformer.putResultDataBody(FessTransformer.java:117)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.generateData(AbstractFessFileTransformer.java:244)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:82)…

discuss · February 6, 2019, 1:15pm

(from github.com/marevol)
I think you used wrong versions.

discuss · February 6, 2019, 4:23pm

(from github.com/freestyle68)
It happens with several docs: pdf, doc, pptx, etc.

I attach a sample of docs with this error, they are from https://openpreservation.org/technology/corpora/govdocs/

My Java version:

openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)

This problem was introduced from commit https://github.com/codelibs/fess/tree/f341a4e2b29d7130bab5b058d1d989c6e3f1634f , because before it was all right

docs.zip

discuss · February 6, 2019, 8:31pm

(from github.com/marevol)
How did you create Fess?

discuss · February 6, 2019, 9:02pm

(from github.com/freestyle68)
mvn antrun:run
mvn package -DskipTests

Then used fess-13.0.0-SNAPSHOT.zip

discuss · February 6, 2019, 9:58pm

(from github.com/marevol)
Thanks, I found it.
Fixed in #2009.

discuss · April 13, 2019, 3:59pm

(from github.com/freestyle68)
Hi,

regarding the starting question, now Fess index also attachments content. Tested with pdf, msg, elm.
So thanks for your commit.

But still missing the attachment filename from Fess index, while tika can extract this also. For example with

java -jar tika-app-1.20.jar -x sample.pdf

I get

<div source="attachment" class="embedded" id="attachment.pdf"/>
<div class="acroform"><ol/>
</div>

and with a msg file I get a similar output:

<div class="attachment-entry"><h1>attachment.pdf</h1>
<div class="package-entry"><h1>attachment.pdf</h1>
<div class="page"><p/>
</div>

Please do not forget to add this feature in the future.

discuss · April 19, 2019, 3:22pm

(from github.com/freestyle68)
Perhaps it was my fault or your commit, but with the actual version (12.6 last commit) I can search also attachments filename.

So this problem is fixed.

Thanks