newline and mapping

Hi,
I want to be able to find new lines in the text (index from the file system crawler).

I tried following:

  • go to System->Dictionary->mapping.txt
  • remove the source char “\u000A” from the target char “\u0020”
  • add a new mapping source char “\u000A” from the target char " ___NEWLINE___ "
  • save, and System Info -> Maintenance -> reindex

For some reason, I cannot find my new delimiter in the content.
What am I missing?

Thanks

Fess has a text preprocessing to reduce a text size.
So, it might be removed and the setting is in fess_config.properties.

crawler.document.space.chars=u0009u000Au000Bu000Cu000Du001Cu001Du001Eu001Fu0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6

Hi,
thanks for the quick response…

I tried removing all chars that might be somehow related to new line (\u000A \u000B \u000C \u000D \u001C \u001D \u001E \u001F)

  • fess_config is now: crawler.document.space.chars=u0009u0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6
  • then I mapped all of these chars to my string delimiter “__NEWLINE_DELIM__”.
  • and reindex in the maintenance page.

Still cannot see my delimiter in the search.
Any other idea?

Thanks,

Hmm…, please try to add the following setting to Config Parameters at File Crawling Configuration page.

config.keep.original.body=true

Thanks for the prompt response!! (you are really doing amazing work with FESS…)

Unfortunately, it did not work yet…
(tried reindexing, starting a new crawl job in the scheduler, and a new folder with new content to crawl, still did not work…).

I don’t think it has anything to do with the issue, but I should note that I’m working with docker instance of fess 13.8-snapshot, together with elasticsearch opendistro from which I remove the security components (to simplify the setup – I’ll return to security later on…).

On a second attempt after a clean install, it worked !!

I did not manage to change the new line escape chars in the mapping to “__NEWLINE__”.
However, when I just made the two changes you mention

  • In fess_config.properties
crawler.document.space.chars=u0009u000Au000Bu000Cu000Du001Cu001Du001Eu001Fu0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6
  • And in file crawling configuration
config.keep.original.body=true

The content itself now has new lines.
I can do the rest with python .splitlines(), instead of looking for a special delimiter (even better than I was trying to do…)

THANKS!!