web crawler crawled pages out side of a defined domain

discuss · May 2, 2019, 5:27am

As the title indicated the issue I am facing, I need your expert advice about how to further debug the issue or how to make it work properly.

I created a web crawler with only included for crawling and included for indexing parameters configured. Both were confined to a particular domain. However, the crawled pages included items that were from domains, such as twitter.com, and www.facebook.com.

Any particular configuration I should check?

Thanks in advance,
Allen

discuss · May 2, 2019, 12:58pm

(from github.com/marevol)
You need to configure a proper Included/Excluded URLs For Crawling.
ex. .*twitter.com.* and .*www.facebook.com.*.

discuss · May 2, 2019, 5:07pm

(from github.com/AllenHan5)
Thanks @marevol for the tip!

Is there a way no need to enumerate all the domains that should be excluded? Since most of the time, we would have no idea about the domains a set of html pages would branch out to.

Thanks

discuss · May 3, 2019, 12:43am

(from github.com/marevol)
Try to use Included URLs For Crawling.

discuss · May 3, 2019, 5:31pm

(from github.com/AllenHan5)
Thanks @marevol for the tip!
It turned out that with my first attempt to fix the issue, I put an incorrect included configuration, which resulted in the whole crawling was stopped (even I had correct include rule to only crawl items from the domain). After I fixed that part, the site got re-crawled, there was no items from out side of the domain.

Thanks again for your help. Really appreciate it!
Allen

discuss · May 24, 2019, 3:46am

(from github.com/eltonkent)
@AllenHan5 Could you please share your settings? I’m seeing a similar problem

discuss · June 1, 2019, 4:44am

(from github.com/AllenHan5)
@eltonkent
Sorry about the late response.

Actually that’s what I noticed. Suppose you are trying to crawl https://www.mysite.com, then in the two included URLs parameters, you can put https://www.mysite.com.*. You don’t have to put anything in the excluded URLs fields, unless you want to exclude some pages from the domain, www.mysite.com.

Hope it helps.
Allen