Fess Crwaling External Domains

akt · October 11, 2022, 5:37am

Hi,

We have configured fess to index only our company website. But it is also crawling the external domains like microsoft.com.

Have posted the sample config below. Please let me know, what’s the issue

Name						Test
URLs						https://www.abc.com/en-us/index.html
Included URLs For Crawling	https://www.abc.com/.*
Excluded URLs For Crawling	
Included URLs For Indexing	https://www.abc.com/.*
Excluded URLs For Indexing	
							(.*)/assets/.*
							(.*)/www/.*
Config Parameters			config.html.canonical.xpath=
							field.xpath.lastModified=//META[@name="lastmodified"]/@content
							field.xpath.releaseDate=//META[@name="releaseDate"]/@content
Depth						10
Max Access Count			50000
User Agent					Mozilla/5.0 (compatible; Fess/13.10; 
                                                     +http://fess.codelibs.org/bot.html)
The number of Thread		          1
Interval time				10000 ms
Boost						1.0
Permissions					{role}guest

Status						Enabled
Description

shinsuke · October 11, 2022, 9:17pm

Did you check fess-crawler.log?

akt · October 12, 2022, 5:43am

Yes. I do see entries for crawling

2022-10-09 06:07:52,504 [Crawler-XaHAI4MBA_J1JKbDyNUS-1-1] INFO  Crawling URL: https://docs.microsoft.com/en-us/sysinternals/downloads/sigcheck
2022-10-09 06:07:52,600 [Crawler-XaHAI4MBA_J1JKbDyNUS-1-1] INFO  Redirect to URL: https://learn.microsoft.com/en-us/sysinternals/downloads/sigcheck
2022-10-09 06:08:12,977 [Crawler-XaHAI4MBA_J1JKbDyNUS-1-1] INFO  Crawling URL: https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpbcgr/023f1e69-cfe8-4ee6-9ee0-7e759fb4e4ee
2022-10-09 06:08:13,044 [Crawler-XaHAI4MBA_J1JKbDyNUS-1-1] INFO  Redirect to URL: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpbcgr/023f1e69-cfe8-4ee6-9ee0-7e759fb4e4ee
2022-10-09 06:21:11,224 [Crawler-XaHAI4MBA_J1JKbDyNUS-1-1] INFO  Crawling URL: https://www.bleepingcomputer.com/news/microsoft/hands-on-with-windows-11s-new-task-manager/

shinsuke · October 12, 2022, 11:34pm

Please try the following setting.

Included URLs For Crawling	https://www.abc.com/.*
Excluded URLs For Crawling	
							(.*)/assets/.*
							(.*)/www/.*
Included URLs For Indexing
Excluded URLs For Indexing

guandalf · January 4, 2023, 2:30pm

Something similar is happening to me as well. I included only a specific url for crawling but the crawler is going to other (external) urls… what are we doing wrong?
I am on 14.5.0

Thank you.

shinsuke · January 4, 2023, 7:22pm

Could you provide the crawling config to reproduce it?