(from github.com/freestyle68)
Hi,
actually an embedded attachment of a pdf is not indexed.
There is a workaround to fix?
Thank you
(from github.com/freestyle68)
Hi,
actually an embedded attachment of a pdf is not indexed.
There is a workaround to fix?
Thank you
(from github.com/freestyle68)
But tika app standalone extract attachments content:
with Tika app https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.20.jar
create two folders, in and out and launch
java -jar tika-app-1.20.jar -T -i /"path for in" -o /"path for out"
in the out folder I can see all attachents extracted, pdf, excel, etc.
For the attached sample.pdf I get the output sample.pdf.txt
I attach the documents.
Why is not possible with Fess?
(from github.com/freestyle68)
with this commmit:
I get a java.lang.ClassCastException with filesystem crawling:
Path: file:/pdfs/
Log:
org.codelibs.fess.crawler.exception.CrawlingAccessException: Could not serialize objectat org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:84)at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77)at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330)at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240)at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176)at java.base/java.lang.Thread.run(Thread.java:844)Caused by: java.lang.ClassCastException: java.base/java.lang.String cannot be cast to java.base/[Ljava.lang.Object;at org.codelibs.fess.crawler.transformer.FessTransformer.putResultDataBody(FessTransformer.java:117)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.generateData(AbstractFessFileTransformer.java:244)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:82)…
(from github.com/freestyle68)
It happens with several docs: pdf, doc, pptx, etc.
I attach a sample of docs with this error, they are from https://openpreservation.org/technology/corpora/govdocs/
My Java version:
openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)
This problem was introduced from commit https://github.com/codelibs/fess/tree/f341a4e2b29d7130bab5b058d1d989c6e3f1634f , because before it was all right
(from github.com/freestyle68)
mvn antrun:run
mvn package -DskipTests
Then used fess-13.0.0-SNAPSHOT.zip
(from github.com/freestyle68)
Hi,
regarding the starting question, now Fess index also attachments content. Tested with pdf, msg, elm.
So thanks for your commit.
But still missing the attachment filename from Fess index, while tika can extract this also. For example with
java -jar tika-app-1.20.jar -x sample.pdf
I get
<div source="attachment" class="embedded" id="attachment.pdf"/>
<div class="acroform"><ol/>
</div>
and with a msg file I get a similar output:
<div class="attachment-entry"><h1>attachment.pdf</h1>
<div class="package-entry"><h1>attachment.pdf</h1>
<div class="page"><p/>
</div>
Please do not forget to add this feature in the future.
(from github.com/freestyle68)
Perhaps it was my fault or your commit, but with the actual version (12.6 last commit) I can search also attachments filename.
So this problem is fixed.
Thanks
© 2020. All Rights Reserved - CodeLibs, Inc.