Unable to crawl a Shopify store

discuss · April 5, 2019, 1:25am

(from github.com/qmaxquique)
Thank you again for developing such a great tool!

I’m using the Fess 12.4.0
I can crawl and index several sites without any issues, but when I try to get this particular site Fess only gets a few URLs crawled (47 in total) and then it just ends the job as it would finished crawling the site.

I’ve tried with several crawling configurations but none of them seems to work.

This is the crawler configuration:

ID	whiH6mkBbO5aJ2YtZEsC
Name	store.repligen.com
URLs	https://store.repligen.com/
Included URLs For Crawling	https://store.repligen.com/.*
Excluded URLs For Crawling	
Included URLs For Indexing	
Excluded URLs For Indexing	.*oembed
.*css
Config Parameters	
Depth	
Max Access Count	
User Agent	Mozilla/5.0 (compatible; Fess/12.4; +http://fess.codelibs.org/bot.html)
The number of Thread	3
Interval time	1200 ms
Boost	1.0
Permissions	{role}www.repligen.com
Virtual Hosts	
Status	Enabled
Description

In the logs, I see the site has several sitemaps.
One of them has the product urls (this is what I want to index) and it seems to be generated dynamically for crawling purposes.

2019-04-05 01:10:31,470 [Crawler-whiH6mkBbO5aJ2YtZEsC-1-1] INFO  Crawling URL: https://store.repligen.com/sitemap_products_1.xml?from=1675528208441&to=2143411404857

This sitemap xml file seems to be valid and well populated, as expected, but Fees is not processing it nor showing any errors.

For test purporses, I’ve changed the URL parameter in the crawler configuration to point directly to the sitemap shown above, however the results are the same.

Is there anything obviously wrong here?

Any clue or help is more than welcome!
Thank you.

discuss · April 7, 2019, 10:48pm

(from marevol (Shinsuke Sugaya) · GitHub)

2019-04-05 01:10:31,470 [Crawler-whiH6mkBbO5aJ2YtZEsC-1-1] INFO Crawling URL: https://store.repligen.com/sitemap_products_1.xml?from=1675528208441&to=2143411404857

Fess handles the above url as xml file, not sitemaps file.
You need to modify rule.xml.

discuss · April 8, 2019, 9:41am

(from github.com/qmaxquique)
@marevol Thank you so much for your help!

I’ve modified that line in the file: ./app/WEB-INF/classes/crawler/rule.xml to look like this:

<arg>"http[s]?:.*sitemap[^/]*.xml$|http[s]?:.*sitemap[^/]*.gz$|http[s]?:.*sitemap[^/]*.txt$|http[s]?:.*sitemap[^/]*.xml.*"</arg>

It solved the problem.

Thank you again!
Enrique