fess_crawler.queue not making progress

We could use some help. The fess_crawler.queue has approximately 1.6 million documents, that do not appear to be processing. Should this queue decrease in size, and is there some settings we need to adjust to make more progress?

The size of a queue depends on a target source for crawling. If you stop a crawler, data remains in the queue. You can delete it to use Clear Crawler Indices on Maintenance page.

Thank you Shinsuke. It appears that the queue is not resuming. If we delete the crawler queue, will it automatically re-index and start downloading again?

The crawler indices are temporal indexes, not for searching. So, you can delete them on the Maintenance page before starting the crawler. They will be created again when starting the crawler.

To resume it, you need to add sessionId method to the crawler job as below.

...container.getComponent("crawlJob").sessionId("default_crawler").logLevel("info")...

I have the same problem.

Many sites are many times in fess_crawler.queue and fess_crawler.data, but not in the search index.

Is sessionId(“default_crawler”) written in the documentation anywhere? Don’t find it. I notice, when I try to run a second default crawler with sessionId(“default_crawler”), the first one stops, why?

DEV TOOLS:
GET /_all/_search
{
“query”: {
“match”: {
“url”: “Journalisten und Journaktivisten
}
}
}
result:
{
“took”: 19,
“timed_out”: false,
“_shards”: {
“total”: 108,
“successful”: 108,
“skipped”: 0,
“failed”: 0
},
“hits”: {
“total”: {
“value”: 23,
“relation”: “eq”
},
“max_score”: 13.776005,
“hits”: [
{
“_index”: “fess_crawler.queue”,
“_id”: “20241026162608-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.776005,
“_source”: {
“sessionId”: “20241026162608-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1729961390284
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241025214659-27.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.712889,
“_source”: {
“sessionId”: “20241025214659-27”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1729899724879
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241027221225-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.298319,
“_source”: {
“sessionId”: “20241027221225-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730068099271
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241103195312-29.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.269122,
“_source”: {
“sessionId”: “20241103195312-29”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730677132288
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241027133111-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.201987,
“_source”: {
“sessionId”: “20241027133111-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730036866513
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241027221227-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.201987,
“_source”: {
“sessionId”: “20241027221227-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730068252380
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241031092011-28.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.81908,
“_source”: {
“sessionId”: “20241031092011-28”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1730502705642,
“executionTime”: 11070,
“contentLength”: 236080
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241108161625-29.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.814385,
“_source”: {
“sessionId”: “20241108161625-29”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1731145696819,
“executionTime”: 455,
“contentLength”: 236357,
“lastModified”: 1731095985000
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241105005307-29.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.807099,
“_source”: {
“sessionId”: “20241105005307-29”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1730842427379,
“executionTime”: 11199,
“contentLength”: 237565
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241028110000-1.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.805907,
“_source”: {
“sessionId”: “20241028110000-1”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1730143429161,
“executionTime”: 84,
“contentLength”: 236428,
“lastModified”: 1729897573000
}
}
]
}
}

fess_crawler.* indices are for the crawler, while indexed documents are stored in the fess.* index. If you encounter any issues, please check fess_crawler.log with debug level enabled.

Here you find some logs in debug level:
https://schwurbeltreff.de/test/fess-crawler.log

Now I stopped the crawlers, deleted the crawler-indizes, used sessionId(…) with default crawlers. We’ll see if that helps.

The crawled URLs seem to be redirected.

2024-11-12 17:06:19,266 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/75265/Allianz-Chef-Diekmann-will-Pimco-in-ruhige-Fahrwasser-bringen
2024-11-12 17:06:19,380 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/07/allianz-chef-diekmann-will-pimco-in-ruhige-fahrwasser-bringen
2024-11-12 17:06:29,384 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/75018/Ukrainische-Armee-setzt-Offensive-im-Osten-fort
2024-11-12 17:06:29,495 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/04/ukrainische-armee-setzt-offensive-im-osten-fort
2024-11-12 17:06:39,500 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/74824/Neue-Studie-Roundup-von-Monsanto-greift-Verdauung-an
2024-11-12 17:06:39,643 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/03/neue-studie-roundup-von-monsanto-greift-verdauung-an
2024-11-12 17:06:49,647 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/75234/Chinas-Online-Riese-Alibaba-geht-in-den-USA-an-die-Boerse
2024-11-12 17:06:49,733 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/07/chinas-online-riese-alibaba-geht-in-den-usa-an-die-boerse

I think I finally found my mistake.

In the Crawl settings, in the URLs field there must be the URL like https://www.test.com/
WITHOUT .*
In the two including URLs fields there must be the URLs like https://www.test.com/.*
WITH .*