Keep data forever

discuss · September 25, 2017, 10:49am

(from github.com/geodawg)
What is the correct configuration to keep all the data the web crawler collects? Data seems to be purging from the main index in random amounts of time. I’ve tried setting the crawler to -1 days, 0 days and then 365 days. The documents all seem to purge at the same time. I want to keep everything even the pages that have been modified. How should I handle this?

discuss · September 25, 2017, 1:41pm

(from github.com/marevol)

Disable Check Last Modifed
Set -1 to Remove Documents Before

That’s it.

discuss · September 28, 2017, 1:50am

(from github.com/geodawg)
Thank you! That fixed it…

discuss · September 28, 2017, 8:00am

(from github.com/micakovic)
Is this setting going to remove documents that no longer exist from the index? Will this not leave many broken links in the index?

discuss · September 28, 2017, 8:09am

(from github.com/marevol)
“Remove Documents Before” is TTL.
It depends on your requirement.
TTL=-1 does not remove documents in the index even if they are removed on a file system.

discuss · September 28, 2017, 8:27am

(from github.com/micakovic)
What would be the correct setting to keep all documents which exist (web crawlers, or file system crawler), regardless of how old they are, but remove documents which have disappeared in the meantime?

rafael · August 31, 2023, 2:38pm

I have the same question:

What would be the correct setting to keep all documents which exist (web crawlers, or file system crawler), regardless of how old they are, but remove documents which have disappeared in the meantime?

shinsuke · August 31, 2023, 8:52pm

To check if a document exists, Fess needs to crawl all documents.
So, in that case, it’s better to use a default setting.

rafael · September 1, 2023, 12:37pm

so if my crawlling task takes 5 days to finish, and I set Remove Documents Before=10 and enable Check Last Modifed , it will remove the documents that doesnt exist and reset TTL on the ones who exists or it will delete all files and have to reindex all again ?

shinsuke · September 1, 2023, 12:57pm

TTL will be updated.