How to debug "Processing no docs"

discuss · March 14, 2018, 4:02pm

(from github.com/rustyx)
For some reason only very few pages crawled from our intranet portal are actually added to the index.
I checked /robots.txt is empty, the downloaded pages do have content, and I have no URL or other restrictions on the crawler.
So how to debug this further?

[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/studieregeling
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/studieregeling
[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/vervoersvergoeding
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/vervoersvergoeding
[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/aansprakelijkheids-verzekering
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/aansprakelijkheids-verzekering
[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/noodadres
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/noodadres
[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/ik-heb-een-tweede-werkgever
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/ik-heb-een-tweede-werkgever
[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/zorgverlof-kort
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/zorgverlof-kort
[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/vakantie-buiten-schoolvakantie-schoolverklaring
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/vakantie-buiten-schoolvakantie-schoolverklaring
[IndexUpdater] INFO  Processing no docs (Doc:{access 1ms, cleanup 31ms}, Mem:{used 185MB, heap 247MB, max 494MB})

discuss · March 14, 2018, 8:39pm

(from github.com/marevol)
See https://github.com/codelibs/fess/issues/1073#issuecomment-304397187

[Crawler-xxxxx-1] INFO  Crawling URL: https://portal.intranet/web/human-resources/vakantie-buiten-schoolvakantie-schoolverklaring
[Crawler-xxxxx-1] INFO  Redirect to URL: https://portal.intranet/web/human-resources/vakantie-buiten-schoolvakantie-schoolverklaring

Pages are redirected, so they are not indexed.
It might work if changing UA in crawling config.

discuss · March 15, 2018, 12:37pm

(from github.com/rustyx)
Looks like there is a bug in redirect reporting. The original URL instead of the redirect URL is logged as the redirect URL.

INFO  Crawling URL: https://portal.intranet/home
DEBUG Accessing https://portal.intranet/home
DEBUG http-outgoing-0 >> GET /home HTTP/1.1
. . .
DEBUG http-outgoing-0 << HTTP/1.1 302 Found
DEBUG http-outgoing-0 << Location: https://portal.intranet/en/home
. . .
INFO  Redirect to URL: https://portal.intranet/home

The last line should contain /en/home instead of /home.

discuss · March 15, 2018, 12:49pm

(from github.com/marevol)
Need more info… partial logs are not helpful…

discuss · March 15, 2018, 9:23pm

(from github.com/marevol)
Thanks. Fixed.

discuss · March 16, 2018, 11:27am

(from github.com/rustyx)
Thanks!