discuss
November 21, 2017, 3:43am
1
(from github.com/anatomo )
I’m using the Fess version 11.4.2.
I tried to crawl sharepoint documents at http://xx.xxx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx/ , but I can’t.
I’m setting empty to crawler.document.html.canonical.xpath in fess_config.properties, and got a 200 response from the server.
The other page can crawl(ex:http://xx.xxx.xxx.xxx/sites/SitePages/xxx.aspx ), but the only Shared%20Documents files can’t crawl.
URL http://xx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx/
Included URLs For Crawling http://xx.xxx.xxx/.*
Excluded URLs For Crawling
Included URLs For Indexing
Excluded URLs For Indexing
discuss
November 21, 2017, 12:39pm
2
(from github.com/marevol )
Could you check fess-crawler.log?
discuss
November 22, 2017, 2:18am
3
(from github.com/anatomo )
I changed settings(because the log was too big).
URL http://xx.xxx.xxx/sites/Shared%20Documents/Forms/AllItems.aspx/
Included URLs For Crawling http://xx.xxx.xxx/sites/Shared%20Documents/.*
Excluded URLs For Crawling
Included URLs For Indexing
Excluded URLs For Indexing
fess-crawler.log
(I replaced some words with “xxxx” or “hogehoge”.)
discuss
November 22, 2017, 3:31am
4
discuss
November 22, 2017, 4:02am
5
(from github.com/anatomo )
I tried it, but it seems that is not working…
fess-crawler.log
discuss
November 28, 2017, 4:01am
6
(from github.com/marevol )
2017-11-22 12:33:35,600 [WebFsCrawler] INFO Included URL: http://xx.xxx.xx.xx/sites/Shared%20Documents/.*
Is this setting correct?
discuss
November 28, 2017, 5:48am
7
(from github.com/anatomo )
The document file url is “http://xx.xxx.xx.xx/sites/Shared%20Documents/xxxx.pdf ” .
So, I set the “included URL”: “http://xx.xxx.xx.xx/sites/Shared%20Documents/.* ” .
Is it incorrect?
I tried “included URL” pattern
1. http://xx.xxx.xx.xx/sites/Shared%20Documents/.*
2. http://xx.xxx.xx.xx/sites/Shared%20Documents.*
3. http://xx.xxx.xx.xx/sites/.*
4. http://xx.xxx.xx.xx/sites.*
5. http://xx.xxx.xx.xx/.*
6. http://xx.xxx.xx.xx.*
7. (NULL)
but, no document files are returned.
discuss
November 28, 2017, 12:48pm
8
(from github.com/marevol )
In your log file, a new child url does not exist.
I think it’s better to check child urls.
$ grep cluded fess-crawler.log
2017-11-22 12:33:35,600 [WebFsCrawler] INFO Included URL: http://xx.xxx.xx.xx/sites/Shared%20Documents/.*
$ grep "Add Child" fess-crawler.log | grep 'http://xx.xxx.xx.xx/sites/Shared%20Documents/'
2017-11-22 12:33:36,872 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,872 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,872 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,873 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,877 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,877 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,878 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,986 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,986 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,987 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,987 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,990 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,991 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
2017-11-22 12:33:36,991 [Crawler-20171122123314-1-5] DEBUG Add Child: http://xx.xxx.xx.xx/sites/Shared%20Documents/Forms/AllItems.aspx
discuss
December 15, 2017, 4:39am
9
(from github.com/anatomo )
Sorry for replying so late.
I read the HTML source, and I listened to Sharepoint Admin.
The Sharepoint page that can’t crawl is using javascript.
So, I think the crawler couldn’t find Child URLs.
(The page is created dynamically.)
Could you tell me the way to crawl javascript pages?