newline and mapping

yaniv · September 24, 2020, 7:13pm

Hi,
I want to be able to find new lines in the text (index from the file system crawler).

I tried following:

go to System->Dictionary->mapping.txt
remove the source char “\u000A” from the target char “\u0020”
add a new mapping source char “\u000A” from the target char " ___NEWLINE___ "
save, and System Info -> Maintenance -> reindex

For some reason, I cannot find my new delimiter in the content.
What am I missing?

Thanks

shinsuke · September 24, 2020, 10:10pm

Fess has a text preprocessing to reduce a text size.
So, it might be removed and the setting is in fess_config.properties.

crawler.document.space.chars=u0009u000Au000Bu000Cu000Du001Cu001Du001Eu001Fu0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6

yaniv · September 25, 2020, 6:35am

Hi,
thanks for the quick response…

I tried removing all chars that might be somehow related to new line (\u000A \u000B \u000C \u000D \u001C \u001D \u001E \u001F)

fess_config is now: crawler.document.space.chars=u0009u0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6
then I mapped all of these chars to my string delimiter “__NEWLINE_DELIM__”.
and reindex in the maintenance page.

Still cannot see my delimiter in the search.
Any other idea?

Thanks,

shinsuke · September 25, 2020, 12:52pm

Hmm…, please try to add the following setting to Config Parameters at File Crawling Configuration page.

config.keep.original.body=true

yaniv · September 26, 2020, 3:55pm

Thanks for the prompt response!! (you are really doing amazing work with FESS…)

Unfortunately, it did not work yet…
(tried reindexing, starting a new crawl job in the scheduler, and a new folder with new content to crawl, still did not work…).

I don’t think it has anything to do with the issue, but I should note that I’m working with docker instance of fess 13.8-snapshot, together with elasticsearch opendistro from which I remove the security components (to simplify the setup – I’ll return to security later on…).

yaniv · September 26, 2020, 6:08pm

On a second attempt after a clean install, it worked !!

I did not manage to change the new line escape chars in the mapping to “__NEWLINE__”.
However, when I just made the two changes you mention

In fess_config.properties

crawler.document.space.chars=u0009u000Au000Bu000Cu000Du001Cu001Du001Eu001Fu0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6

And in file crawling configuration

config.keep.original.body=true

The content itself now has new lines.
I can do the rest with python .splitlines(), instead of looking for a special delimiter (even better than I was trying to do…)

THANKS!!