Keep data forever

(from github.com/geodawg)
What is the correct configuration to keep all the data the web crawler collects? Data seems to be purging from the main index in random amounts of time. I’ve tried setting the crawler to -1 days, 0 days and then 365 days. The documents all seem to purge at the same time. I want to keep everything even the pages that have been modified. How should I handle this?

(from github.com/marevol)

  1. Disable Check Last Modifed
  2. Set -1 to Remove Documents Before

That’s it.

(from github.com/geodawg)
Thank you! That fixed it…

(from github.com/micakovic)
Is this setting going to remove documents that no longer exist from the index? Will this not leave many broken links in the index?

(from github.com/marevol)
“Remove Documents Before” is TTL.
It depends on your requirement.
TTL=-1 does not remove documents in the index even if they are removed on a file system.

(from github.com/micakovic)
What would be the correct setting to keep all documents which exist (web crawlers, or file system crawler), regardless of how old they are, but remove documents which have disappeared in the meantime?

I have the same question:

What would be the correct setting to keep all documents which exist (web crawlers, or file system crawler), regardless of how old they are, but remove documents which have disappeared in the meantime?

To check if a document exists, Fess needs to crawl all documents.
So, in that case, it’s better to use a default setting.

so if my crawlling task takes 5 days to finish, and I set Remove Documents Before=10 and enable Check Last Modifed , it will remove the documents that doesnt exist and reset TTL on the ones who exists or it will delete all files and have to reindex all again ?

TTL will be updated.