How to re-index Failure URLS only ( SMB FS Crawler)

mistaox · June 20, 2022, 5:02pm

fess-14.1.0

I crawled 6TB of PDF files (which took forever) and there were some errors.
The Failure URL section shows 10K files that have samba errors.
Is there a way to re-index the files in the ‘Failure URL’ log without having to re-crawl the entire SMB share?

Thanks

shinsuke · June 20, 2022, 9:32pm

I think you need to create a script to get URLs from Failure URL logs by Admin API and crawl them by CSV list crawling.

mistaox · June 21, 2022, 5:31pm

I was able to use this to export a CSV: (GitHub - pteich/elastic-query-export: 🚚 Export Data from ElasticSearch to CSV/JSON using a Lucene Query (e.g. from Kibana) or a raw JSON Query string)
There needs to be a better way to handle failed URLs as an admin.
It would be nice if I could create a re-crawl job from the Failure URL page.