Crawl files

(from goldenivan · GitHub)
Hi,

I am trying to crawl some files. I am using the version 12.1.0 of Fess. There is the configuration I have used :

In Crawler => File System

  • Name : MyFileCrawl
  • Paths : file:///tmp/crawl/
  • Included Paths For Crawling : file:///tmp/crawl/

The folder crawl has those rights :

drwxr-xr-x 7 root root 66 Mar 30 15:40 crawl

So, I configured a connection in the File Authentication :

  • Scheme : Samba
  • Username : root
  • Password : aBeautifulPassword

I configured this file system config to the file system created previously. Then, I create a job into the scheduler. I launch the crawl job. Then, I have this output in the fess-crawler.log :

2018-03-30 16:06:08,706 [WebFsCrawler] INFO Connected to localhost:9300
2018-03-30 16:06:08,863 [WebFsCrawler] INFO Target Path: file:///tmp/crawl/
2018-03-30 16:06:08,864 [WebFsCrawler] INFO Included Path: file:///tmp/crawl/
2018-03-30 16:06:09,094 [Crawler-aJ_9dmIB4EJ67hIPdwJb-1-3] INFO Crawling URL: file:///tmp/crawl/
2018-03-30 16:06:18,909 [IndexUpdater] INFO Processing no docs (Doc:{access 6ms}, Mem:{used 99MB, heap 153MB, max 495MB})
2018-03-30 16:06:28,893 [IndexUpdater] INFO Processing no docs (Doc:{access 4ms}, Mem:{used 101MB, heap 153MB, max 495MB})
2018-03-30 16:06:38,892 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms}, Mem:{used 102MB, heap 153MB, max 495MB})
2018-03-30 16:06:40,229 [WebFsCrawler] INFO [EXEC TIME] crawling time: 31653ms
2018-03-30 16:06:48,892 [IndexUpdater] INFO Processing no docs (Doc:{access 4ms}, Mem:{used 102MB, heap 153MB, max 495MB})
2018-03-30 16:06:48,892 [IndexUpdater] INFO [EXEC TIME] index update time: 34ms
2018-03-30 16:06:48,942 [main] INFO Finished Crawler

In the folder /tmp/crawl, I have some folders containing files, to test the file crawling. But, as you can see in logs, there is nothing crawled. I tried to adapt the file path value, using one slash, double slash, triple slash. Only the last solution ask me that Fess start to crawl the URL.

Am I missing something ?

(from marevol (Shinsuke Sugaya) · GitHub)

Included Paths For Crawling : file:///tmp/crawl/

Try file:///tmp/crawl/.*.

So, I configured a connection in the File Authentication :

file:/… is for local file system crawling.
So, authentication is not needed.

(from goldenivan · GitHub)
Hi,

Try file:///tmp/crawl/.*.

I removed the authentication, then I tested your proposition. It did not work. Then, I tried like this :

file:///tmp/crawl

This configuration worked.

In the fess configuration here : File Configuration you indicate those information :

This paths are locations to start crawling(ex. file:// or smb://).

Then, you specify that Local File System use file:// and Windows Shared Folder use smb://. Is there any configuration available to crawl another Linux Remote Server ?

(from github.com/marevol)
For NFS, use file:// after mounting it to a local server.