Crawling wikipedia?

discuss · February 9, 2018, 1:25pm

Is it possible to crawl wikipedia using the fess crawler? I have reduced the boost and interval time since wikipedia has some restrictions. But I haven’t been able to crawl their sites (only the main page is crawled and indexed)

Thanks!

discuss · February 9, 2018, 1:36pm

(from github.com/marevol)
Need more info… ex. what is your crawling configs?

discuss · February 9, 2018, 3:51pm

(from github.com/ArthurBV)
Here is the configuration:

I tried putting on “Included URLs For Crawling”: https://es.wikipedia.org/wiki/.* but it didn’t work either.

Also I created this job to schedule the crawling:

discuss · February 10, 2018, 1:02am

(from github.com/marevol)
Interval time is too long.
I tried it and wikipedia pages were indexed.

discuss · February 10, 2018, 1:34am

(from github.com/ArthurBV)
What interval time are you using?

discuss · February 10, 2018, 1:49am

(from github.com/marevol)
To check it in my environment, settings are:

URL: https://es.wikipedia.org/wiki/
Include URL: https://es.wikipedia.org/wiki/.*
Interval time: 1000
Max Access Count: 10

discuss · February 13, 2018, 12:35am

(from github.com/ArthurBV)
Thanks a lot, everything appears to be working correctly.