(from osdn.net/users/sanophoto8)
PDF(透明テキスト付き)の長文(100ページ前後)をクロールしたところ、最初の10ページくらいの文字しか、検索対象になりません。
よくある質問の下記項目を修正しようとしましたが、 http://fess.codelibs.org/ja/faq.html
「文量が多いドキュメントで文末の単語が検索対象にならないようですが…
Solr の設定で対応することができます。solr/core1/conf/solrconfig.xml の maxFieldLength を増やしてください。 増やしすぎるとメモリを消費するので注意してください。」
solrconfig.xml を閲覧すると、 maxFieldLengthは4.0で削除されたようで、maxTokenCountを定義してくださいとのことでした。
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="50000"/>
maxTokenCountを増やしましたが改善されず、文末の単語が検索対象にならないようです。
fess_crawler.outのログは、下記のようです。
2013-10-06 19:32:47,851 [Robot-20131006192831-2-3] INFO org.apache.http.impl.client.DefaultHttpClient - Retrying request
2013-10-06 19:32:50,873 [Robot-20131006192831-2-4] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:32:56,396 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sent 8 documents. The execution time is 172580ms.
2013-10-06 19:32:56,463 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 8. The processing size is 7. The execution time is 67ms.
2013-10-06 19:32:56,526 [Robot-20131006192831-2-4] INFO org.seasar.robot.helper.impl.LogHelperImpl - Crawling URL: file:/home/share/E08_01.pdf
2013-10-06 19:32:56,578 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 1. The processing size is 0. The execution time is 14ms.
2013-10-06 19:32:56,579 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sending 7 document to a server.
2013-10-06 19:33:09,737 [Robot-20131006192831-2-5] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:33:12,699 [Robot-20131006192831-2-5] INFO org.seasar.robot.helper.impl.LogHelperImpl - Crawling URL: file:/home/share/E08_03.pdf
2013-10-06 19:33:13,888 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sent 7 documents. The execution time is 17309ms.
2013-10-06 19:33:25,026 [Robot-20131006192831-3-1] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:33:27,918 [Robot-20131006192831-3-1] INFO org.seasar.robot.helper.impl.LogHelperImpl - Crawling URL: file:/home/share/E08_01.pdf
2013-10-06 19:33:41,088 [Robot-20131006192831-3-3] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:33:43,958 [Robot-20131006192831-3-3] INFO org.seasar.robot.helper.impl.LogHelperImpl - Crawling URL: file:/home/share/E08_03.pdf
2013-10-06 19:33:53,829 [Robot-20131006192831-3-4] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:33:56,309 [Robot-20131006192831-3-4] INFO org.seasar.robot.helper.impl.LogHelperImpl - Crawling URL: file:/home/share/E08_05.pdf
2013-10-06 19:33:56,483 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 5. The processing size is 4. The execution time is 82ms.
2013-10-06 19:33:56,822 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 1. The processing size is 0. The execution time is 5ms.
2013-10-06 19:33:56,822 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sending 4 document to a server.
2013-10-06 19:34:08,169 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sent 4 documents. The execution time is 11347ms.
2013-10-06 19:34:12,472 [Robot-20131006192831-3-2] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:34:28,017 [Robot-20131006192831-2-2] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:34:31,020 [Robot-20131006192831-2-2] INFO org.seasar.robot.helper.impl.LogHelperImpl - Crawling URL: file:/home/share/E08_05.pdf
2013-10-06 19:34:41,701 [Robot-20131006192831-3-5] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:34:54,993 [Robot-20131006192831-2-1] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:34:56,464 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 5. The processing size is 4. The execution time is 57ms.
2013-10-06 19:34:56,512 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 1. The processing size is 0. The execution time is 6ms.
2013-10-06 19:34:56,512 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sending 4 document to a server.
2013-10-06 19:34:57,275 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sent 4 documents. The execution time is 762ms.
2013-10-06 19:35:07,187 [Robot-20131006192831-2-2] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:35:19,808 [Robot-20131006192831-3-4] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:35:31,873 [Robot-20131006192831-3-3] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:35:45,970 [Robot-20131006192831-3-1] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:35:56,425 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 5. The processing size is 4. The execution time is 12ms.
2013-10-06 19:35:56,527 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 1. The processing size is 0. The execution time is 6ms.
2013-10-06 19:35:56,527 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sending 4 document to a server.
2013-10-06 19:35:57,601 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sent 4 documents. The execution time is 1074ms.
2013-10-06 19:35:58,639 [Robot-20131006192831-2-5] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:36:15,112 [Robot-20131006192831-2-4] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:36:33,096 [Robot-20131006192831-2-3] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled operation: EI
2013-10-06 19:36:56,440 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 4. The processing size is 4. The execution time is 20ms.
2013-10-06 19:36:56,510 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 0. The processing size is 0. The execution time is 3ms.
2013-10-06 19:36:56,510 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sending 4 document to a server.
2013-10-06 19:37:00,893 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Sent 4 documents. The execution time is 4383ms.
2013-10-06 19:37:56,430 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 0. The processing size is 0. The execution time is 4ms.
2013-10-06 19:38:56,436 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 0. The processing size is 0. The execution time is 3ms.
2013-10-06 19:38:59,320 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - Deleted completed document data. The execution time is 2884ms.
2013-10-06 19:39:56,441 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 0. The processing size is 0. The execution time is 2ms.
2013-10-06 19:40:40,537 [Web Crawling Process] INFO jp.sf.fess.helper.WebFsIndexHelper - [EXEC TIME] crawling time: 712896ms
2013-10-06 19:40:56,448 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - The number of a crawled document is 0. The processing size is 0. The execution time is 2ms.
2013-10-06 19:40:56,461 [IndexUpdater] INFO jp.sf.fess.solr.IndexUpdater - [EXEC TIME] index update time: 227220ms
2013-10-06 19:40:56,634 [main] INFO jp.sf.fess.exec.Crawler - Deleted segment:20131006185158 in solrGroup1
2013-10-06 19:40:58,045 [main] INFO jp.sf.fess.exec.Crawler - Deleted segment:20131006175249 in solrGroup1
2013-10-06 19:40:58,125 [main] INFO jp.sf.fess.exec.Crawler - Deleted segment:20131006190143 in solrGroup1
2013-10-06 19:41:03,570 [main] INFO jp.sf.fess.exec.Crawler - [EXEC TIME] index commit time: 5286ms
2013-10-06 19:41:03,570 [main] INFO jp.sf.fess.exec.Crawler - Finished Crawler
何かわかる事がございましたら、ご教授ください。
どうぞよろしくお願いいたします。