(from github.com/anticommander)
I have small script running that exports records from the fess.search
index periodically and pipes them through another application. I’m seeing that the same document (based on _id
) is being exported multiple times with different values for the created
and doc_id
fields. All other fields are exactly the same.
This is my export query:
{
"_source": ["doc_id", "created", "content", "host", "url", "filetype", "digest", "label", "title"],
"range" : {
"created" : {
"gte" : "_LAST_EXPORT_RUN_",
"lt" : "_NOW_"
}
}
}
Where _NOW_
is the time of the current script execution and _LAST_EXPORT_RUN_
is the time of the previous script execution.
Say the first time a document is exported it might have these properties (I’ve dropped fields that aren’t applicable to this question):
{
"_index": "fess.20180409",
"_type": "doc",
"_id": "https:%2F%2Fwww.foobar.com%2Ffizz%2Fbuzz",
"_score": 0,
"_source": {
"created": "2019-01-23T00:09:09.082Z",
"doc_id": "bd9fa0d9e46443f1a0c79d81c123b9d2",
"url": "https://www.foobar.com/fizz/buzz",
"content": "fizz buzz foo bar"
}
}
Then the next day when the export script runs again I get the same document (based on _id
) but with different created
and doc_id
values.
{
"_index": "fess.20180409",
"_type": "doc",
"_id": "https:%2F%2Fwww.foobar.com%2Ffizz%2Fbuzz",
"_score": 0,
"_source": {
"created": "2019-01-25T00:08:04.349Z",
"doc_id": "93e61538d94045f8b2bed90b6c82b962",
"url": "https://www.foobar.com/fizz/buzz",
"content": "fizz buzz foo bar"
}
}
The content
field is exactly the same yet created
and doc_id
are different. I would like to know if I can configure a crawler to only crawl and index a particular URL only once. Or if there is another way to be sure I export a document associated with a particular URL only once. Or if I can disable updating of already indexed URLs.
Please let me know if you require further clarification.