Improve crawling results, especially Wordpress sites

I’m new to Fess. I would like to crawl about 100-200 Websites I like.

Most of the 30 sites I entered so far are Wordpress sites and the crawler only crawls a few or none. There are two non Wordpress sites which get crawled well.
There are no Errors in fess.log or fess_crawler.log

What helps:

  • write the URL of the sitemap into the URLs field (mostly /sitemap.xml or /sitemap_index.xml)
  • excluded URLs when crawling: (?i).*(css|js|jpeg|jpg|gif|png|bmp|wmv|ico|exe)
  • excluded URLs when indexing: (?i).*(xml)
  • crawling depth: 4 (often sitemap links go to another sitemap first…)

I haven’t encountered any issues crawling WordPress sites. Could you provide more details so I can try to reproduce the problem?

Thank you for your fast response! :smiley:

Some websites crawl well, some don’t.

For example https://tkp.at is a big news site in german. Only 53 sites were added to the index.
Here are my crawl settings:
URLs https://tkp.at/
https://tkp.at/sitemap.xml
Including URSs when Crawling
Excluded URLs URLs when Crawling: (?i).(css|js|jpeg|jpg|gif|png|bmp|wmv|ico|exe)
Included URLs bei der Indizierung
Excluded URLs in Indizierung: (?i).
(xml)
Depth 6
max. count 200000
User Agent Mozilla/5.0 (compatible; Fess/14.17; +http://fess.codelibs.org/bot.html)
Thread-count: 3
intervall 10000 ms
Boost 1.0
Berechtigungen {role}guest
Virtuelle Hosts
Status active

Second example, https://transition-news.org/, there were many sites of it crawled, I saw in the logs, but only 61 added to the index, it should be many thousands.
URLs https://transition-news.org/
https://transition-news.org/sitemap.xml
Including URSs when Crawling
Excluded URLs URLs when Crawling: (?i).(css|js|jpeg|jpg|gif|png|bmp|wmv|ico|exe)
Included URLs bei der Indizierung
Excluded URLs in Indizierung: (?i).
(xml)
Depth 4
max. count 150000
User Agent Mozilla/5.0 (compatible; Fess/14.17; +http://fess.codelibs.org/bot.html)
Thread-count: 3
intervall 10000 ms
Boost 1.0
Berechtigungen {role}guest
Virtuelle Hosts
Status active

There were no errors in fess-crawler.log

Adding a sitemap helps in many cases, but not in all.
Maybe you could try to crawl these domains yourself?

Crawler queue has 4 million docs, index has 266.673 in this moment, growing slower than the queue.
Maybe the crawling isn’t the problem, but the indexer, wich is another process? But looks like I can’t run indexer without crawler, right?
Now took a snapshot of the queue, deleted it and let fess create a new one. First impression: Did not help, queue is growing again, but “GET fess.search/_count” not. :frowning:

Looking forward to your answer. :blush:

I tried crawling https://tkp.at, and it seems to be working without major issues. However, when I checked the fess-crawler.log, I noticed that URLs like https://tkp.at/_static and others that aren’t useful are also being crawled. It might be a good idea to exclude these URLs by adding them to the “Excluded URLs” section using Java regular expressions.

How can I see, how many documents for a Domain are in the index?
When I do in Dev Tools:
GET fess.search/_count
{
“query”: {
“wildcard”: {
“url”: {
“value”: “tkp.at
}
}
}
}
then I get:
{
“count”: 66,
“_shards”: {
“total”: 10,
“successful”: 10,
“skipped”: 0,
“failed”: 0
}
}
So it’s 66, right? It should be much higher. :cry:
And again I have 1 Mio. docs in the queue and growing…

Did you set a value to Including URSs, like https://tkp.at/.*?

In the “URLs” Field I have:

https://tkp.at/
https://tkp.at/sitemap.xml

In the “Including URLs when Crawling” field and the “Including URLs when Indexing” field I have nothing.

Now I added https://tkp.at/.* in all three fields, startet the crawler and now the indexed sites count of tkp is growing. :grinning: Thank you very much! Do I need to add it in all three?
Maybe the documentation could be a little more detailed here:

I think there should be a simple way inside of Fess to see how many Sites per Domain are in the index, without Opensearch Dashboards. Besides this, and if the crawling problem iss solved now, I like the software very much!

Do I need to add it in all three?

Yes, but it depends on your requirements. It’s best to check fess-crawler.log to see where Fess is crawling.

Your advise helps with many websites, thank you very much again, but helps not with:

URLs: https://transition-news.org/.*
https://transition-news.org/sitemap.xml
Including URLs both: https://transition-news.org/.*

fess_crawler.log

2024-10-25 12:44:18,387 [Crawler-20241025123620-1-1] INFO  Crawling URL: https://transition-news.org/sucht-der-westen-einen-ausweg-aus-dem-ukraine-krieg
2024-10-25 12:44:19,794 [Crawler-20241025123620-1-3] INFO  Crawling URL: https://transition-news.org/apolut-stiller-abschied-von-der-ukraine
2024-10-25 12:44:19,839 [IndexUpdater] INFO  Processing 3/3 docs (Doc:{access 9ms, cleanup 10ms}, Mem:{used 170.389MB, heap 1.024GB, max 4.096GB})
2024-10-25 12:44:19,864 [IndexUpdater] INFO  Processing no docs in indexing queue (Doc:{access 1ms, cleanup 11ms}, Mem:{used 172.592MB, heap 1.024GB, max 4.096GB})
2024-10-25 12:44:19,884 [IndexUpdater] INFO  Sent 3 docs (Doc:{process 13ms, send 19ms, size 128.45KB}, Mem:{used 175.001MB, heap 1.024GB, max 4.096GB})

but it only has 100 documents in the index, not growing, should be thousands.

You need to create the crawling configuration based on your requirements. Alternatively, if you’d like assistance with this setup, please contact commercial support.