(from github.com/abolotnov)
I have a task of crawling a few thousands sites. The way I am doing this now is by adding additional web crawlers and starting default crawler. Apparently, this is a bad strategy for me because I don’t really need them all sites re-crawled every time - maintaining a high accuracy of indexed content is not important. And it just takes forever, too.
I thought of creating individual crawling jobs for each of the new crawler that I create. Is this a valid approach? At this point I have about 5K sites to crawl. 5K web crawlers and 5K jobs - is FESS going to handle this? Do I need to sequence their start (like start them in batches of 10 or something?)
Many thanks. FESS seems like a great system, I am trying to use it properly now.
(from github.com/marevol)
Fess Crawler is just a Java process.
If you create and run a crawling job, one process is executed.
Currently, Fess does not manage the number of crawler processes.
So, you need to do that…
In a future release, I’ll add a threshold to manage the number of executed processes.
(from github.com/abolotnov)
So the “Simultaneous Crawler Config” is not it then?
How do I view status of jobs? admin/joblog seems to be close enough but it will just give me a log of everything executed with start/end time instead of status of distinct jobs.