Included URLs help

(from github.com/jfabales)
Hi Fess team,

I’ve been testing your product for a few months now, and I just need some advice with something that we want to do.
Basically we have a website https://docs.company.com/ and inside is a bunch of software documentations, see examples below:

https://docs.company.com/apple/apple-0.0.1/
https://docs.company.com/apple/apple-0.0.2/
https://docs.company.com/apple/apple-latest/
https://docs.company.com/orange-one/orange-one-1.0.0/
https://docs.company.com/orange-one/orange-one-1.0.1/
https://docs.company.com/orange-one/orange-one-1.0.2/
https://docs.company.com/orange-one/orange-one-1.0.3/
https://docs.company.com/orange-one/orange-one-official/
https://docs.company.com/orange-one/orange-one-latest/
...

We have hundreds of these and we could index everything just fine, but it gets a bit messy when searching for “orange” for example because we cannot predict which version of the software documentation comes out on top and sometimes the documentation for an older version is the one we see first in the search results. So what we want to do is to have https://docs.company.com/.*(latest|official)+.* to have a higher doc boost. We tried with 2 web crawlers and the other one has this config and a higher boost:

URL: https://docs.company.com/
Included URLs for Crawling: https://docs.company.com/.*(latest|official)+.*
Included URLs for Indexing: https://docs.company.com/.*(latest|official)+.*
Boost: 3.0

Unfortunately this setup doesn’t work and we keep getting URL is null errors on fess-crawler log. So we ended up with this config to only index/crawl those with “latest” or “official” in the URL string:

URL: https://docs.company.com/
Included URLs for Crawling: https://docs.company.com/.*
Excluded URLs for Crawling: https://docs.example.com/(?!.*?(?:latest|official)).*
Included URLs for Indexing: https://docs.company.com/.*
Excluded URLs for Indexing: https://docs.example.com/(?!.*?(?:latest|official)).*
Boost: 3.0

Unfortunately this doesn’t work as well.

We’ve tried different approaches as well but they are not that accurate. I would really love to know you suggested approach for this.

Thanks!

(from github.com/jfabales)
Hmm I think I got this to work with this config (still crawling though):

URL: https://docs.company.com/
Included URLs for Crawling: https://docs.company.com/.*
Included URLs for Indexing: https://docs.company.com/.*

Doc Boost:
Condition: url.matches("^https://docs.company.com/(.*?(?:latest|official)).*")
Boost expr: 100

But I would still love to hear your answer.

(from github.com/marevol)
I think that Doc Boost is a proper solution.

(from github.com/jfabales)
Cool! Thanks, I’ll close this now.