Prune Child Tags

(from github.com/zackhorvath)
Hello!

I’m running into an issue where important_content is being generated by //*[self::H1 or self::H2 or self::H3] in transformer.xml, but some of our H1 headers have script child elements which are appearing in plain text when I look at the important_content field after crawling.

I’ve added the following configuration to our fess_config.properties:
crawler.document.html.pruned.tags=noscript,script,style,header,footer,nav,h1.span,h1.script
…but am still seeing the javascript!

This is the specific code block on one of our websites that’s causing this issue:

<h1 class="cluster_title">
      <!-- Lockerz Share BEGIN -->
      <span class="share_options" style="display: inline;">
        <a class="a2a_dd" href="https://example.com">
    		<img src="/etc/designs/public/img/buttons/plus_icon_static.png" class="rolloverImage" data-hover="/etc/designs/public/img/buttons/plus_icon_hover.png" width="27" height="26"><span class="share_link_text">Share</span>
    	</a>
      </span>
      <script type="text/javascript">
      <!--
        a2a_config = {};
        a2a_config.onclick=1;
        a2a_config.show_title=0;
        a2a_config.num_services=10;
        /*a2a_config.linkname=document.title;
        a2a_config.linkurl=location.href;*/
        a2a_config.prioritize=["facebook","twitter","linkedin","digg","delicious","myspace","read_it_later","squidoo","technorati_favorites","care2_news"];
        /*a2a_config.color_main="D7E5ED";*/
        /*a2a_config.color_border="AECADB";*/
        a2a_config.color_link_text="123054";
        a2a_config.color_link_text_hover="24548d";
      //-->
      </script>
      <script type="text/javascript" src="//static.addtoany.com/menu/page.js"></script>
      <!-- Lockerz Share END --> 
PAGE TITLE
</h1>

Thank you so much for your time! I haven’t been able to crack this, I feel like h1.span and h1.script should be excluding these, but I’m still seeing all the a2a entries in important_content!

(from github.com/marevol)
title and important_content field do not apply crawler.document.html.pruned.tags to remove these elements at the moment…

(from github.com/zackhorvath)
You rock! Thank you!