How to minimize re-crawling of sites?

discuss · January 24, 2019, 5:34am

(from github.com/abolotnov)
I have a task of crawling a few thousands sites. The way I am doing this now is by adding additional web crawlers and starting default crawler. Apparently, this is a bad strategy for me because I don’t really need them all sites re-crawled every time - maintaining a high accuracy of indexed content is not important. And it just takes forever, too.

I thought of creating individual crawling jobs for each of the new crawler that I create. Is this a valid approach? At this point I have about 5K sites to crawl. 5K web crawlers and 5K jobs - is FESS going to handle this? Do I need to sequence their start (like start them in batches of 10 or something?)

Many thanks. FESS seems like a great system, I am trying to use it properly now.

discuss · January 24, 2019, 12:53pm

(from github.com/marevol)
Fess Crawler is just a Java process.
If you create and run a crawling job, one process is executed.
Currently, Fess does not manage the number of crawler processes.
So, you need to do that…

In a future release, I’ll add a threshold to manage the number of executed processes.

discuss · January 24, 2019, 3:02pm

(from github.com/abolotnov)
So the “Simultaneous Crawler Config” is not it then?

How do I view status of jobs? admin/joblog seems to be close enough but it will just give me a log of everything executed with start/end time instead of status of distinct jobs.

discuss · January 24, 2019, 8:56pm

(from marevol (Shinsuke Sugaya) · GitHub)

So the “Simultaneous Crawler Config” is not it then?

You can set multiple crawling configs in a crawling job.
The parameter manages the number of crawling configs in one job.

For details of logs in crawler, see fess-crawler.log.

discuss · January 24, 2019, 11:41pm

(from github.com/abolotnov)
Is there a way to create/delete crawling job via admin rest api?

discuss · January 24, 2019, 11:52pm

(from github.com/marevol)
See API doc and questions in Issues.

discuss · January 26, 2019, 7:43am

(from github.com/abolotnov)
Thanks