I want to be able to find new lines in the text (index from the file system crawler).
I tried following:
- go to System->Dictionary->mapping.txt
- remove the source char “\u000A” from the target char “\u0020”
- add a new mapping source char “\u000A” from the target char " ___NEWLINE___ "
- save, and System Info -> Maintenance -> reindex
For some reason, I cannot find my new delimiter in the content.
What am I missing?
Fess has a text preprocessing to reduce a text size.
So, it might be removed and the setting is in fess_config.properties.
thanks for the quick response…
I tried removing all chars that might be somehow related to new line (\u000A \u000B \u000C \u000D \u001C \u001D \u001E \u001F)
- fess_config is now: crawler.document.space.chars=u0009u0020u00A0u1680u180Eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200Au200Bu200Cu202Fu205Fu3000uFEFFuFFFDu00B6
- then I mapped all of these chars to my string delimiter “__NEWLINE_DELIM__”.
- and reindex in the maintenance page.
Still cannot see my delimiter in the search.
Any other idea?
Hmm…, please try to add the following setting to Config Parameters at File Crawling Configuration page.
Thanks for the prompt response!! (you are really doing amazing work with FESS…)
Unfortunately, it did not work yet…
(tried reindexing, starting a new crawl job in the scheduler, and a new folder with new content to crawl, still did not work…).
I don’t think it has anything to do with the issue, but I should note that I’m working with docker instance of fess 13.8-snapshot, together with elasticsearch opendistro from which I remove the security components (to simplify the setup – I’ll return to security later on…).
On a second attempt after a clean install, it worked !!
I did not manage to change the new line escape chars in the mapping to “__NEWLINE__”.
However, when I just made the two changes you mention
- In fess_config.properties
- And in file crawling configuration
The content itself now has new lines.
I can do the rest with python .splitlines(), instead of looking for a special delimiter (even better than I was trying to do…)