(from github.com/rustyx)
For some reason only very few pages crawled from our intranet portal are actually added to the index.
I checked /robots.txt is empty, the downloaded pages do have content, and I have no URL or other restrictions on the crawler.
So how to debug this further?
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/studieregeling
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/studieregeling
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/vervoersvergoeding
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/vervoersvergoeding
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/aansprakelijkheids-verzekering
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/aansprakelijkheids-verzekering
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/noodadres
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/noodadres
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/ik-heb-een-tweede-werkgever
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/ik-heb-een-tweede-werkgever
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/zorgverlof-kort
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/zorgverlof-kort
[Crawler-xxxxx-1] INFO Crawling URL: https://portal.intranet/web/human-resources/vakantie-buiten-schoolvakantie-schoolverklaring
[Crawler-xxxxx-1] INFO Redirect to URL: https://portal.intranet/web/human-resources/vakantie-buiten-schoolvakantie-schoolverklaring
[IndexUpdater] INFO Processing no docs (Doc:{access 1ms, cleanup 31ms}, Mem:{used 185MB, heap 247MB, max 494MB})