Best practice for limit crawl time or number of url:s

Hi!

I have sat up a crawler that should index ca 3500 pages. The pages are listed and the list pages are set to no-index. It takes the crawler 14 hours before it finishes with status fail

The last lines are:

2023-04-23 03:54:07,567 [IndexUpdater] INFO  Processing no docs (Doc:{access 5ms, cleanup 34ms}, Mem:{used 236MB, heap 512MB, max 512MB})
2023-04-23 03:54:07,570 [IndexUpdater] INFO  Terminating indexUpdater. emptyListCount is over 3600.
2023-04-23 03:54:07,575 [WebFsCrawler] INFO  [EXEC TIME] crawling time: 50040386ms

I can see that the same pages are beeing crawled ca 30 times during the period. Is there a way to limit the time for the crawler or set a maximum number of indexed pages?

I have set the Max Access Count to 5000.

Terminating indexUpdater. emptyListCount is over 3600.

It means that IndexUpdater is timed out.
If you have a lot of list pages, a crawler does not send documents to the index updater, and then it occurs.

If “3500 pages” is correct, it’s better to check fess-crawler.log, and you should update the crawler setting to remove unexpected pages.

If crawled pages are correct, you can change the timeout in fess_config.properties:

indexer.webfs.max.empty.list.count=3600

My urls are beeing crawled 30 times each. The 3500 pages that I want to have in the index are done in about 2 hours, the rest of the 12 hours it repeats them.

Is there a way to get the crawler to only axess an url once?

What does 3600 in indexer.webfs.max.empty.list.count=3600 mean? Is it time or number of url:s?

If i put the list pages in Excluded URLs For Crawling then the crawler will not axess the pages that are beeing listed, right? Now I have the list pages in Excluded URLs For Indexing

It’s better to set Included/Excluded URLs for Crawling/Indexing. It depends on your requirement.
Excluded URLs For Crawling means a crawler ignores the URLs.