Ignore part of HTML

discuss · November 30, 2015, 1:03pm

(from github.com/fhoubie)
Hi, I’ve installed FESS to index a web site. This web site is a software documentation web site with table of content on the left and text on the right. We do not use IFRAME, we inject the html into a div dynamically based on an ID in the URL. When I add the root url to FESS crawlers, it browses the web site but include in all the pages and index the table of content. So if we search for one work of the table of content, we retrieve all the pages of the web site. I’ve done the test crawling the mysql developer web site (https://dev.mysql.com/doc/refman/5.7/en/) and the behavior is not the same. The table of content is not parsed and indexed. What is the trick ? Do I have to use specific class name in the elements I want to ignore ?

Thanks

Frédéric

discuss · December 1, 2015, 2:10pm

(from github.com/marevol)
No trick. I checked it but the toc of MySQL site was indexed.
Although Fess is able to ignore the specific elements by specifying it in a configuration file, it is not documented in English at the moment…

discuss · December 1, 2015, 3:02pm

(from github.com/fhoubie)
Is it linked to the HTMLMapper of Tika ?

discuss · December 1, 2015, 9:08pm

(from github.com/marevol)
No. nekohtml is used.

discuss · November 7, 2016, 11:33am

(from github.com/Ozius)
Hi, I have a similar problem.

I am indexing a website with a common header and footer in all of the pages, when I search for a word that is included on the header the results are not good.

Can I config the crawler for ignoring the header a footer html tags?

Thanks!

discuss · November 7, 2016, 1:18pm

(from github.com/marevol)
To ignore tags by name, modify the following value in fess_config.properties

crawler.document.html.pruned.tags=noscript,script,style

discuss · November 7, 2016, 1:36pm

(from github.com/Ozius)
Thanks @marevol !!

And to ignore a div with a concrete id or class? Is it just as easy?

discuss · November 7, 2016, 1:50pm

(from github.com/marevol)
id/class attributes are not supported at the moment.
I think that it’s better to specify indexed content by XPATH in fess_config.properties as below.

crawler.document.html.content.xpath=//BODY