indexing time question

rafael · September 12, 2023, 12:54pm

Hi, I have a doubt. If fess stops in the middle os a indexing task. It will begin all over again? I ask becouse I have 1.500.000 files and it was indexed 500.000, but we had to start it again and it seems that its taking the same time to go to the point where it stopped.
Fess will run all over the files untill the stop point? Shouldnt at least it be faster to go until that çont? or he has to calculate hash and other things of each file? Will it always take that amount of time every indexing task? Can it be improved ?

thank you.

shinsuke · September 13, 2023, 11:36am

If you want to continue a crawler, you need to add a sessionId() method to the script of the Default Crawler job as below. The string argument can be anything.

return container.getComponent("crawlJob").logLevel("info").gcLogging().sessionId("default_crawler").execute(executor);

rafael · September 13, 2023, 11:59am

I dont have a default craweler, I have two separated, one to filesystem, other to datastore, they are configured as below:

return container.getComponent("crawlJob").logLevel("info").sessionId("4EnG5IkBVq0bdBbnzIVl").webConfigIds([] as String[]).fileConfigIds(["4EnG5IkBVq0bdBbnzIVl"] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();

return container.getComponent("crawlJob").logLevel("info").sessionId("LnDEA4oBvxD_9Z3XqUm_").jobExecutor(executor).execute();

Its almost two days that I started the jobs again and only increased 10 documents indexed.

and now I see… its deleting a lot of files, its not suposed to… I just stoped and starded fess, TTL of files are set to 30 days.

shinsuke · September 13, 2023, 11:36pm

Although I cannot understand what you want to do, a crawling target of the second crawler contains the first one. So, it’s better for the second crawler to exclude 4EnG5IkBVq0bdBbnzIVl.

rafael · September 14, 2023, 12:58am

Hi Shinsuke, I didnt understand what should I delete.
I have two crawler jobs:

SETORES is a filesystem:

return container.getComponent("crawlJob").logLevel("info").sessionId("4EnG5IkBVq0bdBbnzIVl").webConfigIds([] as String[]).fileConfigIds(["4EnG5IkBVq0bdBbnzIVl"] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();

SGED is a datastore:

return container.getComponent("crawlJob").logLevel("info").sessionId("LnDEA4oBvxD_9Z3XqUm_").jobExecutor(executor).execute();

I was crawling SETORES and got 543.000 indexed files, 1.000.000 crawling and more 400.000 in queue. Then we had a power failure and I had to start those jobs again. But instead it continue from the break point, it looks like started all over again, because indexed files doesnt grow, and looks like its deleting files. Shouldnt it just continue from where it stoped indexing? Its almost three days that anything is indexed, its just going throu all files again, I guess.

And what should I delete from crawler job? I didnt understand.

Thank you.

shinsuke · September 14, 2023, 5:45am

The above setting contains the filesystem crawling. To avoid it, you need to add dataConfigIds() to specify the datastore crawling.
You can create a crawler job from the datastore crawling page.

rafael · September 14, 2023, 11:43am

Something like this?

return container.getComponent("crawlJob").logLevel("info").sessionId("LnDEA4oBvxD_9Z3XqUm_").webConfigIds([] as String[]).fileConfigIds(["LnDEA4oBvxD_9Z3XqUm_"] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();

And what about the continue, souldnt it continue crawling from where it stopped ? Should I configure something ?

shinsuke · September 14, 2023, 1:37pm

I think it should be:

return container.getComponent("crawlJob").logLevel("info").sessionId("LnDEA4oBvxD_9Z3XqUm_").dataConfigIds(["LnDEA4oBvxD_9Z3XqUm_"] as String[]).jobExecutor(executor).execute();

If a crawler with a session ID stops, it restarts with the session data.

rafael · September 14, 2023, 2:09pm

Thank you.

Is there a way to stop fess or only shutting down opensearch and fess .bat ?

shinsuke · September 16, 2023, 1:38am

You can add Fess as a service on Windows. Please see the doc.