Improve crawling results, especially Wordpress sites

CaptainFuture · October 21, 2024, 3:07pm

I’m new to Fess. I would like to crawl about 100-200 Websites I like.

Most of the 30 sites I entered so far are Wordpress sites and the crawler only crawls a few or none. There are two non Wordpress sites which get crawled well.
There are no Errors in fess.log or fess_crawler.log

What helps:

write the URL of the sitemap into the URLs field (mostly /sitemap.xml or /sitemap_index.xml)
excluded URLs when crawling: (?i).*(css|js|jpeg|jpg|gif|png|bmp|wmv|ico|exe)
excluded URLs when indexing: (?i).*(xml)
crawling depth: 4 (often sitemap links go to another sitemap first…)

shinsuke · October 22, 2024, 12:27pm

I haven’t encountered any issues crawling WordPress sites. Could you provide more details so I can try to reproduce the problem?

CaptainFuture · October 22, 2024, 2:50pm

Thank you for your fast response!

Some websites crawl well, some don’t.

For example https://tkp.at is a big news site in german. Only 53 sites were added to the index.
Here are my crawl settings:
URLs https://tkp.at/
https://tkp.at/sitemap.xml
Including URSs when Crawling
Excluded URLs URLs when Crawling: (?i).(css|js|jpeg|jpg|gif|png|bmp|wmv|ico|exe)
Included URLs bei der Indizierung
Excluded URLs in Indizierung: (?i).(xml)
Depth 6
max. count 200000
User Agent Mozilla/5.0 (compatible; Fess/14.17; +http://fess.codelibs.org/bot.html)
Thread-count: 3
intervall 10000 ms
Boost 1.0
Berechtigungen {role}guest
Virtuelle Hosts
Status active

Second example, https://transition-news.org/, there were many sites of it crawled, I saw in the logs, but only 61 added to the index, it should be many thousands.
URLs https://transition-news.org/
https://transition-news.org/sitemap.xml
Including URSs when Crawling
Excluded URLs URLs when Crawling: (?i).(css|js|jpeg|jpg|gif|png|bmp|wmv|ico|exe)
Included URLs bei der Indizierung
Excluded URLs in Indizierung: (?i).(xml)
Depth 4
max. count 150000
User Agent Mozilla/5.0 (compatible; Fess/14.17; +http://fess.codelibs.org/bot.html)
Thread-count: 3
intervall 10000 ms
Boost 1.0
Berechtigungen {role}guest
Virtuelle Hosts
Status active

There were no errors in fess-crawler.log

Adding a sitemap helps in many cases, but not in all.
Maybe you could try to crawl these domains yourself?

Crawler queue has 4 million docs, index has 266.673 in this moment, growing slower than the queue.
Maybe the crawling isn’t the problem, but the indexer, wich is another process? But looks like I can’t run indexer without crawler, right?
Now took a snapshot of the queue, deleted it and let fess create a new one. First impression: Did not help, queue is growing again, but “GET fess.search/_count” not.

Looking forward to your answer.

shinsuke · October 24, 2024, 7:35am

I tried crawling https://tkp.at, and it seems to be working without major issues. However, when I checked the fess-crawler.log, I noticed that URLs like https://tkp.at/_static and others that aren’t useful are also being crawled. It might be a good idea to exclude these URLs by adding them to the “Excluded URLs” section using Java regular expressions.

CaptainFuture · October 24, 2024, 8:07pm

How can I see, how many documents for a Domain are in the index?
When I do in Dev Tools:
GET fess.search/_count
{
“query”: {
“wildcard”: {
“url”: {
“value”: “tkp.at”
}
}
}
}
then I get:
{
“count”: 66,
“_shards”: {
“total”: 10,
“successful”: 10,
“skipped”: 0,
“failed”: 0
}
}
So it’s 66, right? It should be much higher.
And again I have 1 Mio. docs in the queue and growing…

shinsuke · October 24, 2024, 10:59pm

Did you set a value to Including URSs, like https://tkp.at/.*?

CaptainFuture · October 25, 2024, 7:47am

In the “URLs” Field I have:

https://tkp.at/
https://tkp.at/sitemap.xml

In the “Including URLs when Crawling” field and the “Including URLs when Indexing” field I have nothing.

Now I added https://tkp.at/.* in all three fields, startet the crawler and now the indexed sites count of tkp is growing. Thank you very much! Do I need to add it in all three?
Maybe the documentation could be a little more detailed here:

I think there should be a simple way inside of Fess to see how many Sites per Domain are in the index, without Opensearch Dashboards. Besides this, and if the crawling problem iss solved now, I like the software very much!

shinsuke · October 25, 2024, 11:55am

Do I need to add it in all three?

Yes, but it depends on your requirements. It’s best to check fess-crawler.log to see where Fess is crawling.

CaptainFuture · October 25, 2024, 12:28pm

Your advise helps with many websites, thank you very much again, but helps not with:

URLs: https://transition-news.org/.*
https://transition-news.org/sitemap.xml
Including URLs both: https://transition-news.org/.*

fess_crawler.log

2024-10-25 12:44:18,387 [Crawler-20241025123620-1-1] INFO  Crawling URL: https://transition-news.org/sucht-der-westen-einen-ausweg-aus-dem-ukraine-krieg
2024-10-25 12:44:19,794 [Crawler-20241025123620-1-3] INFO  Crawling URL: https://transition-news.org/apolut-stiller-abschied-von-der-ukraine
2024-10-25 12:44:19,839 [IndexUpdater] INFO  Processing 3/3 docs (Doc:{access 9ms, cleanup 10ms}, Mem:{used 170.389MB, heap 1.024GB, max 4.096GB})
2024-10-25 12:44:19,864 [IndexUpdater] INFO  Processing no docs in indexing queue (Doc:{access 1ms, cleanup 11ms}, Mem:{used 172.592MB, heap 1.024GB, max 4.096GB})
2024-10-25 12:44:19,884 [IndexUpdater] INFO  Sent 3 docs (Doc:{process 13ms, send 19ms, size 128.45KB}, Mem:{used 175.001MB, heap 1.024GB, max 4.096GB})

but it only has 100 documents in the index, not growing, should be thousands.

shinsuke · October 25, 2024, 1:05pm

You need to create the crawling configuration based on your requirements. Alternatively, if you’d like assistance with this setup, please contact commercial support.