Exclude pages from being indexed

Hi!

I’ve tried to exclude all pages with paging such as https://www.oru.se/english/news/?page=4
I’ve put java regexp in “Excluded URLs For Indexing” and tried:
.*\?page=\d{1,4}$

Is that correct and how can I see if that worked. In crawler.log I can only see “Excluded URL:” and thats the url:s in "Excluded from crawling not excluded from indexing, right?

I also have a problem with no-index in that all those paging url:s turn up over 150 times in each crawl. Why does Fess crawl a noindex- url more than one time each crawl?

Example from a search of one url in crawler.log:

 F:\Skrivbordet\fess-crawler.log  (382 hits)
	Line 49307: 2021-12-08 04:40:27,977 [Crawler-20211208000000-2-4] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 49308: 2021-12-08 04:40:28,121 [Crawler-20211208000000-2-4] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 50590: 2021-12-08 04:46:06,826 [Crawler-20211208000000-2-7] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 50591: 2021-12-08 04:46:06,984 [Crawler-20211208000000-2-7] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 51236: 2021-12-08 04:48:59,341 [Crawler-20211208000000-2-5] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 51237: 2021-12-08 04:48:59,493 [Crawler-20211208000000-2-5] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 51917: 2021-12-08 04:51:59,819 [Crawler-20211208000000-2-1] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 51919: 2021-12-08 04:51:59,957 [Crawler-20211208000000-2-1] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 52643: 2021-12-08 04:55:11,056 [Crawler-20211208000000-2-1] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 52644: 2021-12-08 04:55:11,205 [Crawler-20211208000000-2-1] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 53404: 2021-12-08 04:58:30,004 [Crawler-20211208000000-2-8] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 53406: 2021-12-08 04:58:30,153 [Crawler-20211208000000-2-8] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 54183: 2021-12-08 05:01:57,121 [Crawler-20211208000000-2-3] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 54186: 2021-12-08 05:01:57,966 [Crawler-20211208000000-2-3] INFO  META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 54967: 2021-12-08 05:05:20,998 [Crawler-20211208000000-2-1] INFO  Crawling URL: https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10
	Line 54969: 2021-12-08 05:05:21,180 [Crawler-20211208000000-2-1] INFO  
META(robots=noindex): https://www.oru.se/nyheter/nyhetsarkiv/nyhetsarkiv-2014/?page=10

And so on…

\d may be \\d.

META(robots=noindex) means that the page contains a meta tag with noindex.

1 Like

But why does the crawler crawl the same links with noindex 150 times each crawl?

What is the configuration for the web crawling config?






I could not reproduce it…

この問題をどのように解決すればよいですか?