How to crawl the sharepoint documents?

discuss · November 21, 2017, 3:43am

(from github.com/anatomo)
I’m using the Fess version 11.4.2.
I tried to crawl sharepoint documents at http://xx.xxx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx/ , but I can’t.
I’m setting empty to crawler.document.html.canonical.xpath in fess_config.properties, and got a 200 response from the server.
The other page can crawl(ex:http://xx.xxx.xxx.xxx/sites/SitePages/xxx.aspx), but the only Shared%20Documents files can’t crawl.

URL				http://xx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx/
Included URLs For Crawling	http://xx.xxx.xxx/.*
Excluded URLs For Crawling	
Included URLs For Indexing	
Excluded URLs For Indexing

discuss · November 21, 2017, 12:39pm

(from github.com/marevol)
Could you check fess-crawler.log?

discuss · November 22, 2017, 2:18am

(from github.com/anatomo)
I changed settings(because the log was too big).

URL				http://xx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx/
Included URLs For Crawling	http://xx.xxx.xxx/sites/Shared%20Documents/.*
Excluded URLs For Crawling	
Included URLs For Indexing	
Excluded URLs For Indexing

fess-crawler.log
(I replaced some words with “xxxx” or “hogehoge”.)

discuss · November 22, 2017, 3:31am

(from github.com/marevol)
Could you try to change URL to http://xx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx

discuss · November 22, 2017, 4:02am

(from github.com/anatomo)
I tried it, but it seems that is not working…
fess-crawler.log

discuss · November 28, 2017, 4:01am

(from marevol (Shinsuke Sugaya) · GitHub)

2017-11-22 12:33:35,600 [WebFsCrawler] INFO Included URL: http://xx.xxx.xx.xx/sites/Shared%20Documents/.*

Is this setting correct?

discuss · November 28, 2017, 5:48am

(from github.com/anatomo)
The document file url is “http://xx.xxx.xx.xx/sites/Shared%20Documents/xxxx.pdf” .
So, I set the “included URL”: “http://xx.xxx.xx.xx/sites/Shared%20Documents/.*” .
Is it incorrect?

I tried “included URL” pattern

1. http://xx.xxx.xx.xx/sites/Shared%20Documents/.*
2. http://xx.xxx.xx.xx/sites/Shared%20Documents.*
3. http://xx.xxx.xx.xx/sites/.*
4. http://xx.xxx.xx.xx/sites.*
5. http://xx.xxx.xx.xx/.*
6. http://xx.xxx.xx.xx.*
7. (NULL)

but, no document files are returned.

discuss · November 28, 2017, 12:48pm

(from github.com/marevol)
In your log file, a new child url does not exist.
I think it’s better to check child urls.

$ grep cluded fess-crawler.log 
2017-11-22 12:33:35,600 [WebFsCrawler] INFO  Included URL: http://xx.xxx.xx.xx/sites/Shared%20Documents/.*
$ grep "Add Child" fess-crawler.log | grep 'http://xx.xxx.xx.xx/sites/Shared%20Documents/'
2017-11-22 12:33:36,872 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,872 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,872 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,873 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,877 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,877 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,878 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,986 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,986 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,987 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,987 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,990 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,991 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,991 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx

discuss · December 15, 2017, 4:39am

(from github.com/anatomo)
Sorry for replying so late.

I read the HTML source, and I listened to Sharepoint Admin.

The Sharepoint page that can’t crawl is using javascript.
So, I think the crawler couldn’t find Child URLs.
(The page is created dynamically.)

Could you tell me the way to crawl javascript pages?