Scaling the Crawler

Thanks for all of your work on FESS!

I am working on a project that requires crawling + indexing about 6,000 sites. They need to be crawled for new content every 2 weeks. I am trying to think through the best way to set this up:

It appears that I can only add roughly 150 URLs to be crawled/indexed for one crawler configuration. With this being the case, would it be best to use 40 crawlers with 150 sites per configuration, or 6,000 crawlers with 1 site per crawler configuration?

Also, I have read through the documentation, but I am still unclear of the relation between TTL and incremental scanning. If my goal is to check for new/deleted content every 2 weeks, should I set the TTL for 14 days, schedule the crawler to run every 14 days, and just disable incremental scanning?


would it be best to use 40 crawlers with 150 sites per configuration,

Although it depends on a requirement, I think that the above is preferred.

should I set the TTL for 14 days

It should be 14 days + running time for the crawler.
If you set 14 days to TTL, they may be removed before indexing.

Thanks! If I do not set a TTL and just enable incremental scanning, would this make it more efficient?

It depends on your requirement.
If disabling incremental crawling, pages removed on web sites will not be removed in an index.

But they would eventually be deleted after the TTL expires, correct?


Thanks - last question related to this:

In my testing, the initial crawl of a site consumes a lot of bandwidth. If I scan again with incremental scanning enabled, significantly less bandwidth is used.

So for performance reasons, I am thinking of disabling the TTL and just using incremental scanning every 21 days. Would there be any issues with this?

It seems good to me.