Scaling the Crawler

discuss · January 1, 2018, 7:59pm

(from github.com/looper976)
Thanks for all of your work on FESS!

I am working on a project that requires crawling + indexing about 6,000 sites. They need to be crawled for new content every 2 weeks. I am trying to think through the best way to set this up:

It appears that I can only add roughly 150 URLs to be crawled/indexed for one crawler configuration. With this being the case, would it be best to use 40 crawlers with 150 sites per configuration, or 6,000 crawlers with 1 site per crawler configuration?

Also, I have read through the documentation, but I am still unclear of the relation between TTL and incremental scanning. If my goal is to check for new/deleted content every 2 weeks, should I set the TTL for 14 days, schedule the crawler to run every 14 days, and just disable incremental scanning?

discuss · January 2, 2018, 1:27pm

(from marevol (Shinsuke Sugaya) · GitHub)

would it be best to use 40 crawlers with 150 sites per configuration,

Although it depends on a requirement, I think that the above is preferred.

should I set the TTL for 14 days

It should be 14 days + running time for the crawler.
If you set 14 days to TTL, they may be removed before indexing.

discuss · January 2, 2018, 9:52pm

(from github.com/looper976)
Thanks! If I do not set a TTL and just enable incremental scanning, would this make it more efficient?

discuss · January 3, 2018, 2:22am

(from github.com/marevol)
It depends on your requirement.
If disabling incremental crawling, pages removed on web sites will not be removed in an index.

discuss · January 3, 2018, 2:32am

(from github.com/looper976)
But they would eventually be deleted after the TTL expires, correct?

discuss · January 3, 2018, 7:18am

(from github.com/marevol)
Yes.

discuss · January 4, 2018, 2:58pm

(from github.com/looper976)
Thanks - last question related to this:

In my testing, the initial crawl of a site consumes a lot of bandwidth. If I scan again with incremental scanning enabled, significantly less bandwidth is used.

So for performance reasons, I am thinking of disabling the TTL and just using incremental scanning every 21 days. Would there be any issues with this?

discuss · January 4, 2018, 9:52pm

(from github.com/marevol)
It seems good to me.