cleaning html

Martin Bähr

2014-11-09 16:32:41 UTC

hi,

i am pretty sure i am reinventing the wheel here, but i could not find any
example how to remove unwanted html tags.

common suggestion is to use a whitelist, and that's exactly the problem.
Parser.HTML only processes known tags, but to use a whitelist, all the unknown
tags need to be removed.

consequently i wrote this:

string clean_html(string input, array(string)|void keeptags)
{
if (!keeptags)
keeptags = ({ "p", "li", "b", "i", "ol", "sup", "a", "strong", "em", "img", "blockquote" });
mapping tags_found = ([ "tags":([]), "containers":([]) ]);

void collect_tags(object tag, string text)
{
if (tag->tag_name()[0]=='/')
tags_found->containers[tag->tag_name()[1..]] = "";
else
tags_found->tags[tag->tag_name()] = "";
};

// record the name of every opening and closing tag
// closing tags indicate containers

Parser.HTML()->_set_tag_callback(collect_tags)->feed(input)->read();

// because collect_tags is called for opening and closing tags separately,
// the opening tags can't be distinguished between tags and containers,
// but "tags" should only contain tags that are not containers

tags_found->tags -= tags_found->containers;
tags_found->tags -= keeptags;
tags_found->containers -= keeptags;

// now we have a list of tags and containers that we want to clean out
// couldn't find a way to reset _set_tag_callback, so use a new parser object instead.

return Parser.HTML()->add_containers(tags_found->containers)->add_tags(tags_found->tags)->feed(input)->read();

// TODO: handle containers where we want to keep the content and only
// remove the tags themselves.
// handle unwanted tag-attributes
}

anyone got a better wheel?

greetings, martin.

--
eKita - the online platform for your entire academic life
--
chief engineer eKita.co
pike programmer pike.lysator.liu.se caudium.net societyserver.org
BLUG secretary beijinglug.org
foresight developer foresightlinux.org realss.com
unix sysadmin
Martin Bähr working in china http://societyserver.org/mbaehr/