How to crawl in real time

luigigf49 · April 20, 2020, 5:56pm

Hi Marevol,

I wish I could do something like this:
When I define a datastore crawler in the Parameter field I insert a string, for example, made like this:
sql = select * from MA_ARTIC
Then the scheduler will update the ElasticSearch indexes at a later time by entering the information relating to the new records inserted in the MA_ARTIC table or by modifying those of the records already present in the indexes.
I would like to have the possibility to define a datastore crawler where the string:
sql = select * from MA_ARTIC
takes the form:
sql = select {variable} from MA_ARTIC
where variable represents the key of a new record not yet indexed in ElasticSearch.
Immediately afterwards I would activate the scheduler configured only for the crawler that I have described.
In this way I would have immediately available in the search indexes the information of this new record.
I hope I’ve been sufficiently clear.

Many thanks
Luigi

shinsuke · April 20, 2020, 10:03pm

Why don’t you use where clause with like update time column?

luigigf49 · April 21, 2020, 6:51am

Hi Marevol,

I understand I have not explained my need well.
In practice I have an ERP and, for the same table, for example a table called MA_ARTIC, I should define crawlers with very different script profiles based on needs that I can’t completely pre-define from the beginning.
I would therefore like to achieve some objectives:

crawling not of the whole database table but only partially
do not wait for the crawler to run but run the crawler immediately so that new information is immediately available in searches
run the crawler only for new records that I insert without waiting for the crawler to do its job for the whole database table which could be very large
So my question is about being able to have crawlers for the same table with script profiles that can change by making use of variables to be set according to the needs of the ERP day by day.
In the last few days I have implemented some programs that use the FESS API and I believe that, at this point, I should develop programs that, through the API, create from time to time the crawlers that serve my ERP and the schedulers that serve to run the newly created crawlers.
After checking that a new crawler has been completely and correctly executed I will delete the crawler and its scheduler otherwise after a while I would find myself with hundreds of crawlers working on the same table.
Do you think that what I said is right or that there are better systems than what I have described ?

Many thanks
Luigi

shinsuke · April 21, 2020, 9:00pm

It’s a specific case. so it’s better to use create your own program with Admin API.