Can't get fess to index a Discourse site

discuss · May 23, 2017, 8:51am

(from github.com/pcolmer)
We are using Discourse as our forum platform, e.g. discuss.96boards.org. I’ve added the root URL to our web crawler configuration and the logs show that fess is finding the top level categories but it isn’t finding any of the actual articles.

Looking at the source code for a category (e.g. https://discuss.96boards.org/c/general) shows that the main visible content (i.e. the list of topics) is actually being adding via a script on that page. It would seem that fess isn’t crawling this script code and therefore isn’t able to find the referenced topic pages.

Is there a configuration setting I can change that will convince fess to do what I need it to do?

discuss · May 23, 2017, 1:46pm

(from github.com/marevol)
You need to set non-dynamic pages as crawler starting points, such as https://discuss.96boards.org/latest. It’s better to create sitemaps.xml.

discuss · May 23, 2017, 3:21pm

(from github.com/pcolmer)
There are a couple of drawbacks to trying to use /latest:

It is also a dynamic page.
By the nature of its name, it only returns the latest pages, not all of them.

The drawback to trying to create a sitemap is that a discussion site is, by its very nature, going to be dynamic so I’d end up trying to find a tool that I can use to generate the sitemap on a daily basis …

Update: I have found a plugin for Discourse that generates a sitemap file so I’m currently in the process of installing that to try it out with fess.

discuss · May 23, 2017, 9:15pm

(from github.com/marevol)
/latest is not a dynamic page by JS and has a link of the next page(/latest?no_definitions=true&page=X).

curl https://discuss.96boards.org/latest

discuss · May 24, 2017, 11:02am

(from github.com/pcolmer)
That’s a fair point - it looks like different output is returned if curl is used, vs a web browser.

However, that almost makes the situation worse in terms of what fess is retrieving …

According to the crawler log, for example, fess is crawling https://discuss.96boards.org/categories. Now, that includes a link to the Products category, https://discuss.96boards.org/c/products. If I use curl on that URL, I get a lot of topics being returned, but fess isn’t crawling any of them.

Furthermore, I now have a full sitemap.xml file on the Discourse server, the crawler is retrieving it but none of the topics are being crawled/indexed.

The included URLs for crawling are:

https://discuss.96boards.org/.*

and the excluded URLs for crawling are:

.*/.*\?.*
.*/.*\.png
.*/.*\.jpg
.*/.*\.gif
.*/.*\.ico
.*/.*\.css
.*/.*\.js

The indexing boxes (included and excluded URLs for indexing) are both empty.

Why isn’t fess finding everything?

discuss · May 24, 2017, 3:02pm

(from marevol (Shinsuke Sugaya) · GitHub)

it looks like different output is returned if curl is used, vs a web browser.

Try to change UA.

discuss · May 26, 2017, 8:37am

(from github.com/pcolmer)
Thank you for suggesting changing the user agent string. I’ve updated the crawler for discuss.96boards.org so that it uses the same agent string as curl (since that seems to be effective at getting HTML that fess ought to be able to index) but the pages still aren’t being crawled.

Is there any way of increasing the diagnostics being logged so that I can try and figure out what is going on here?

discuss · May 26, 2017, 9:13am

(from github.com/pcolmer)
Just to be certain I wasn’t overlooking anything in the logs, I’ve created a separate crawler job just to crawl discuss.96boards.org and I’ve attached the crawler log from it. There really isn’t much happening

fess-crawler.txt

discuss · May 26, 2017, 1:54pm

(from github.com/marevol)
Did you try https://discuss.96boards.org/latest ?

https://discuss.96boards.org/ is JS-based dynamic page.

curl https://discuss.96boards.org/

discuss · May 27, 2017, 7:16am

(from github.com/pcolmer)
I must be doing something wrong with the configuration.

I’ve just added https://discuss.96boards.org/latest as another URL to crawl, run the crawler and, apart from logging the fact that it is an included URL for crawling, nothing else changes.

fess-crawler.txt

discuss · May 27, 2017, 7:20am

(from github.com/marevol)

2017-05-27 07:02:02,784 [WebFsCrawler] INFO  Target URL: https://discuss.96boards.org/
2017-05-27 07:02:02,785 [WebFsCrawler] INFO  Included URL: https://discuss.96boards.org/.*
2017-05-27 07:02:02,785 [WebFsCrawler] INFO  Included URL: https://discuss.96boards.org/latest

You added …/latest to “Included URL”, not URL.

discuss · May 30, 2017, 9:42am

(from github.com/pcolmer)
I’ve modified the crawler configuration so that URL now specifies https://discuss.96boards.org/latest but it still isn’t working properly. I’ve attached the full log but, as an example, fess says it is crawling https://discuss.96boards.org/t/linux-mipi-dsi-panel-support/196 but if I do a search of all documents, there are only six recorded with the label for this crawl and none of them are this URL.

I also don’t understand why the sitemap isn’t working. It is almost as if fess isn’t parsing the content of these pages for some reason.

fess-crawler.txt

discuss · May 30, 2017, 10:31am

(from github.com/pcolmer)
I’ve enabled debug-level logging and rerun the crawler. There are some exceptions getting logged but I don’t know enough to understand if they are affecting the outcome.

fess-crawler.txt

discuss · May 30, 2017, 1:40pm

(from github.com/marevol)

2017-05-30 09:21:07,196 [WebFsCrawler] INFO  Excluded URL: .*/.*\?.*

I think that the above setting removes the following url:

/latest?no_definitions=true&amp;page=1

You can ignore exceptions for debug level.

discuss · June 1, 2017, 8:50am

(from github.com/pcolmer)
Hi

In order to avoid getting dynamic pages back from Discourse, I have - per your suggestion - changed the user agent being specified by fess when crawling the site. The user agent now being used is the same one as used by curl. As a result, the content returned by Discourse no longer uses Javascript and just returns plain HTML with all the content in it.

So when I go:

curl https://discuss.96boards.org/latest

that doesn’t redirect to anything containing a ? in the URL. Instead, I get this:

latest.txt

That contains a number of A references and the fess-crawler log provided two days ago shows that fess is, indeed, retrieving /latest and parsing it because the log shows that fess then crawls the pages mentioned in /latest. HOWEVER, the content from those linked pages is not showing up in the index.

Furthermore, even though I’ve added the sitemap XML file per another earlier suggestion from you, that is not getting fess to crawl every page in that file.

discuss · June 1, 2017, 8:59am

(from marevol (Shinsuke Sugaya) · GitHub)

the content from those linked pages is not showing up in the index.

Did you check fess-cralwer.log?

Furthermore, even though I’ve added the sitemap XML file per another earlier suggestion from you, that is not getting fess to crawl every page in that file.

Was …/sitemap.xml included in fess-crawler.log?

discuss · June 1, 2017, 9:35am

(from github.com/pcolmer)
Hi

The linked pages and sitemap.xml are showing up in fess-crawler.log as being crawled:

2017-05-30 09:28:10,620 [Crawler-AVwL2uQ_JiGs9_BBEDat-1-3] INFO  Crawling URL: https://discuss.96boards.org/sitemap.xml

2017-05-30 09:27:39,731 [Crawler-AVwL2uQ_JiGs9_BBEDat-1-1] INFO  Crawling URL: https://discuss.96boards.org/t/linux-mipi-dsi-panel-support/196

but, as I’ve explained, none of the pages referenced in sitemap.xml are being crawled and even when fess does crawl one of the topic pages, the content of that page isn’t being indexed so that it can then be searched.