(from github.com/zackhorvath)
Hello!
I’m running into an issue where important_content is being generated by //*[self::H1 or self::H2 or self::H3]
in transformer.xml, but some of our H1 headers have script child elements which are appearing in plain text when I look at the important_content field after crawling.
I’ve added the following configuration to our fess_config.properties:
crawler.document.html.pruned.tags=noscript,script,style,header,footer,nav,h1.span,h1.script
…but am still seeing the javascript!
This is the specific code block on one of our websites that’s causing this issue:
<h1 class="cluster_title">
<!-- Lockerz Share BEGIN -->
<span class="share_options" style="display: inline;">
<a class="a2a_dd" href="https://example.com">
<img src="/etc/designs/public/img/buttons/plus_icon_static.png" class="rolloverImage" data-hover="/etc/designs/public/img/buttons/plus_icon_hover.png" width="27" height="26"><span class="share_link_text">Share</span>
</a>
</span>
<script type="text/javascript">
<!--
a2a_config = {};
a2a_config.onclick=1;
a2a_config.show_title=0;
a2a_config.num_services=10;
/*a2a_config.linkname=document.title;
a2a_config.linkurl=location.href;*/
a2a_config.prioritize=["facebook","twitter","linkedin","digg","delicious","myspace","read_it_later","squidoo","technorati_favorites","care2_news"];
/*a2a_config.color_main="D7E5ED";*/
/*a2a_config.color_border="AECADB";*/
a2a_config.color_link_text="123054";
a2a_config.color_link_text_hover="24548d";
//-->
</script>
<script type="text/javascript" src="//static.addtoany.com/menu/page.js"></script>
<!-- Lockerz Share END -->
PAGE TITLE
</h1>
Thank you so much for your time! I haven’t been able to crack this, I feel like h1.span and h1.script should be excluding these, but I’m still seeing all the a2a entries in important_content!