crawl xhtml

(from github.com/cernadasjuan)
Hi,

I need to crawl a page that has xhtml format, and when I run the crawler’s job, this page is not being crawled. Is this configuration right?

ID AVhUCyaPiNth7RnHqMUS
Name Dos Ideas - cursos
URLs http://www.dosideas.com/cursos/
Included URLs For Crawling http://www.dosideas.com/cursos/.*
Excluded URLs For Crawling
Included URLs For Indexing http://www.dosideas.com/cursos/.*
Excluded URLs For Indexing
Config Parameters
Depth 100
Max Access Count
User Agent Mozilla/5.0 (compatible; Fess/10.2; +http://fess.codelibs.org/bot.html)
The number of Tread 10
Interval time 0 ms
Boost 1.0
Permissions {role}guest
Label
Status Enabled

Thanks!

(from github.com/marevol)
Please check crawled urls in fess-crawler.log.

(from github.com/MajidSafari)
check

canonical meta

Included URLs For Indexing http://www.dosideas.com/cursos/.*

must in meta canonical http://www.dosideas.com/cursos/

(from github.com/cernadasjuan)
Thanks for the help.

Finally, the problem was that the page had the following link:

“< link href=“http://www.dosideas.com/java” rel=“canonical” />”

which is incorrect, because the right url is http://www.dosideas.com/cursos

It seems that Fess skips the entire page where the canonical url is incorrect.