Thanks and questions

unifynz · September 14, 2020, 11:02pm

Hi,

Firstly a massive thanks for the work you have put into FESS, its a excellent tool and has saved me a lot of time and money.

We are using FESS to cache and search across every domain in New Zealand. our TLD is .nz so I have regular expressions in our crawlers to only crawl these domains. e.g

..co.nz
..co.nz/.*
..org.nz
..org.nz/.*

This works excellent.

I have 5 crawlers, each are pointing at NZ’s largest directorys - I have the depth set to 10,000 and the max access count set ti 99,999,999 to capture as many .NZ domains as possible. These directories contain hundreds of thousands of urls.

To date, we have crawled just over 1M pages since the 1st Sept, these pages are from 21,000 .NZ domains so far, and there is 5M urls in the crawl queue to work through. It’s working away great.

I have a few questions:

Once the scheduled default crawler has finished crawling these 5 sites, I will run it again in a month. If these sites block my IP address, will the default crawler still update the index of other sites it has found?

Is there a way I can run a separate crawler to only update the URLs in the current index?

Would it maybe be better to export all of the domains and add them to their own crawler so im not relying on the 5 directory websites?

If the two different crawlers find the same URL will it cache the page twice? or will it update the existing? Does FESS check the URL is already indexed and update it?

Again, many thanks and hopefully you can help me structure this better. I have also email your sales about professional support for this application.

Thanks

shinsuke · September 15, 2020, 3:25am

will the default crawler still update the index of other sites it has found?

If other sites are linked from others, they will be crawled. If not, they are not crawled.

Is there a way I can run a separate crawler to only update the URLs in the current index?

You might use CSV List DataStore crawling.

Would it maybe be better to export all of the domains and add them to their own crawler so im not relying on the 5 directory websites?

You can use Admin API for Fess.

If the two different crawlers find the same URL will it cache the page twice?

It will be updated.