fess_crawler.queue not making progress

ggriffin · April 18, 2022, 5:30pm

We could use some help. The fess_crawler.queue has approximately 1.6 million documents, that do not appear to be processing. Should this queue decrease in size, and is there some settings we need to adjust to make more progress?

shinsuke · April 19, 2022, 11:45am

The size of a queue depends on a target source for crawling. If you stop a crawler, data remains in the queue. You can delete it to use Clear Crawler Indices on Maintenance page.

ggriffin · April 19, 2022, 9:27pm

Thank you Shinsuke. It appears that the queue is not resuming. If we delete the crawler queue, will it automatically re-index and start downloading again?

shinsuke · April 19, 2022, 10:42pm

The crawler indices are temporal indexes, not for searching. So, you can delete them on the Maintenance page before starting the crawler. They will be created again when starting the crawler.

To resume it, you need to add sessionId method to the crawler job as below.

...container.getComponent("crawlJob").sessionId("default_crawler").logLevel("info")...

CaptainFuture · November 11, 2024, 8:44pm

I have the same problem.

Many sites are many times in fess_crawler.queue and fess_crawler.data, but not in the search index.

Is sessionId(“default_crawler”) written in the documentation anywhere? Don’t find it. I notice, when I try to run a second default crawler with sessionId(“default_crawler”), the first one stops, why?

DEV TOOLS:
GET /_all/_search
{
“query”: {
“match”: {
“url”: “Journalisten und Journaktivisten”
}
}
}
result:
{
“took”: 19,
“timed_out”: false,
“_shards”: {
“total”: 108,
“successful”: 108,
“skipped”: 0,
“failed”: 0
},
“hits”: {
“total”: {
“value”: 23,
“relation”: “eq”
},
“max_score”: 13.776005,
“hits”: [
{
“_index”: “fess_crawler.queue”,
“_id”: “20241026162608-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.776005,
“_source”: {
“sessionId”: “20241026162608-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1729961390284
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241025214659-27.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.712889,
“_source”: {
“sessionId”: “20241025214659-27”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1729899724879
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241027221225-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.298319,
“_source”: {
“sessionId”: “20241027221225-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730068099271
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241103195312-29.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.269122,
“_source”: {
“sessionId”: “20241103195312-29”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730677132288
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241027133111-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.201987,
“_source”: {
“sessionId”: “20241027133111-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730036866513
}
},
{
“_index”: “fess_crawler.queue”,
“_id”: “20241027221227-23.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 13.201987,
“_source”: {
“sessionId”: “20241027221227-23”,
“method”: “GET”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“depth”: 2,
“createTime”: 1730068252380
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241031092011-28.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.81908,
“_source”: {
“sessionId”: “20241031092011-28”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1730502705642,
“executionTime”: 11070,
“contentLength”: 236080
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241108161625-29.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.814385,
“_source”: {
“sessionId”: “20241108161625-29”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1731145696819,
“executionTime”: 455,
“contentLength”: 236357,
“lastModified”: 1731095985000
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241105005307-29.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.807099,
“_source”: {
“sessionId”: “20241105005307-29”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1730842427379,
“executionTime”: 11199,
“contentLength”: 237565
}
},
{
“_index”: “fess_crawler.data”,
“_id”: “20241028110000-1.aHR0cHM6Ly9qdW5nZWZyZWloZWl0LmRlL2RlYmF0dGUva29tbWVudGFyLzIwMTcvam91cm5hbGlzdGVuLXVuZC1qb3VybmFrdGl2aXN0ZW4v”,
“_score”: 12.805907,
“_source”: {
“sessionId”: “20241028110000-1”,
“ruleId”: “webHtmlRule”,
“url”: “Journalisten und Journaktivisten”,
“parentUrl”: “https://jungefreiheit.de/sitemap-posttype-post.2017.xml”,
“status”: 9999,
“httpStatusCode”: 200,
“method”: “GET”,
“mimeType”: “text/html”,
“createTime”: 1730143429161,
“executionTime”: 84,
“contentLength”: 236428,
“lastModified”: 1729897573000
}
}
]
}
}

shinsuke · November 12, 2024, 12:39pm

fess_crawler.* indices are for the crawler, while indexed documents are stored in the fess.* index. If you encounter any issues, please check fess_crawler.log with debug level enabled.

CaptainFuture · November 12, 2024, 5:27pm

Here you find some logs in debug level:
https://schwurbeltreff.de/test/fess-crawler.log

Now I stopped the crawlers, deleted the crawler-indizes, used sessionId(…) with default crawlers. We’ll see if that helps.

shinsuke · November 13, 2024, 12:56pm

The crawled URLs seem to be redirected.

2024-11-12 17:06:19,266 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/75265/Allianz-Chef-Diekmann-will-Pimco-in-ruhige-Fahrwasser-bringen
2024-11-12 17:06:19,380 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/07/allianz-chef-diekmann-will-pimco-in-ruhige-fahrwasser-bringen
2024-11-12 17:06:29,384 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/75018/Ukrainische-Armee-setzt-Offensive-im-Osten-fort
2024-11-12 17:06:29,495 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/04/ukrainische-armee-setzt-offensive-im-osten-fort
2024-11-12 17:06:39,500 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/74824/Neue-Studie-Roundup-von-Monsanto-greift-Verdauung-an
2024-11-12 17:06:39,643 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/03/neue-studie-roundup-von-monsanto-greift-verdauung-an
2024-11-12 17:06:49,647 [Crawler-20241026162608_23-17-3] INFO  Crawling URL: https://deutsche-wirtschafts-nachrichten.de/75234/Chinas-Online-Riese-Alibaba-geht-in-den-USA-an-die-Boerse
2024-11-12 17:06:49,733 [Crawler-20241026162608_23-17-3] INFO  Redirect to URL: https://deutsche-wirtschafts-nachrichten.de/2014/05/07/chinas-online-riese-alibaba-geht-in-den-usa-an-die-boerse

CaptainFuture · November 14, 2024, 6:55pm

I think I finally found my mistake.

In the Crawl settings, in the URLs field there must be the URL like https://www.test.com/
WITHOUT .*
In the two including URLs fields there must be the URLs like https://www.test.com/.*
WITH .*