Martin Bähr
2014-11-09 16:32:41 UTC
hi,
i am pretty sure i am reinventing the wheel here, but i could not find any
example how to remove unwanted html tags.
common suggestion is to use a whitelist, and that's exactly the problem.
Parser.HTML only processes known tags, but to use a whitelist, all the unknown
tags need to be removed.
consequently i wrote this:
string clean_html(string input, array(string)|void keeptags)
{
if (!keeptags)
keeptags = ({ "p", "li", "b", "i", "ol", "sup", "a", "strong", "em", "img", "blockquote" });
mapping tags_found = ([ "tags":([]), "containers":([]) ]);
void collect_tags(object tag, string text)
{
if (tag->tag_name()[0]=='/')
tags_found->containers[tag->tag_name()[1..]] = "";
else
tags_found->tags[tag->tag_name()] = "";
};
// record the name of every opening and closing tag
// closing tags indicate containers
Parser.HTML()->_set_tag_callback(collect_tags)->feed(input)->read();
// because collect_tags is called for opening and closing tags separately,
// the opening tags can't be distinguished between tags and containers,
// but "tags" should only contain tags that are not containers
tags_found->tags -= tags_found->containers;
tags_found->tags -= keeptags;
tags_found->containers -= keeptags;
// now we have a list of tags and containers that we want to clean out
// couldn't find a way to reset _set_tag_callback, so use a new parser object instead.
return Parser.HTML()->add_containers(tags_found->containers)->add_tags(tags_found->tags)->feed(input)->read();
// TODO: handle containers where we want to keep the content and only
// remove the tags themselves.
// handle unwanted tag-attributes
}
anyone got a better wheel?
greetings, martin.
i am pretty sure i am reinventing the wheel here, but i could not find any
example how to remove unwanted html tags.
common suggestion is to use a whitelist, and that's exactly the problem.
Parser.HTML only processes known tags, but to use a whitelist, all the unknown
tags need to be removed.
consequently i wrote this:
string clean_html(string input, array(string)|void keeptags)
{
if (!keeptags)
keeptags = ({ "p", "li", "b", "i", "ol", "sup", "a", "strong", "em", "img", "blockquote" });
mapping tags_found = ([ "tags":([]), "containers":([]) ]);
void collect_tags(object tag, string text)
{
if (tag->tag_name()[0]=='/')
tags_found->containers[tag->tag_name()[1..]] = "";
else
tags_found->tags[tag->tag_name()] = "";
};
// record the name of every opening and closing tag
// closing tags indicate containers
Parser.HTML()->_set_tag_callback(collect_tags)->feed(input)->read();
// because collect_tags is called for opening and closing tags separately,
// the opening tags can't be distinguished between tags and containers,
// but "tags" should only contain tags that are not containers
tags_found->tags -= tags_found->containers;
tags_found->tags -= keeptags;
tags_found->containers -= keeptags;
// now we have a list of tags and containers that we want to clean out
// couldn't find a way to reset _set_tag_callback, so use a new parser object instead.
return Parser.HTML()->add_containers(tags_found->containers)->add_tags(tags_found->tags)->feed(input)->read();
// TODO: handle containers where we want to keep the content and only
// remove the tags themselves.
// handle unwanted tag-attributes
}
anyone got a better wheel?
greetings, martin.
--
eKita - the online platform for your entire academic life
--
chief engineer eKita.co
pike programmer pike.lysator.liu.se caudium.net societyserver.org
BLUG secretary beijinglug.org
foresight developer foresightlinux.org realss.com
unix sysadmin
Martin Bähr working in china http://societyserver.org/mbaehr/
eKita - the online platform for your entire academic life
--
chief engineer eKita.co
pike programmer pike.lysator.liu.se caudium.net societyserver.org
BLUG secretary beijinglug.org
foresight developer foresightlinux.org realss.com
unix sysadmin
Martin Bähr working in china http://societyserver.org/mbaehr/