cleaning html
Martin Bähr
2014-11-09 16:32:41 UTC

i am pretty sure i am reinventing the wheel here, but i could not find any
example how to remove unwanted html tags.

common suggestion is to use a whitelist, and that's exactly the problem.
Parser.HTML only processes known tags, but to use a whitelist, all the unknown
tags need to be removed.

consequently i wrote this:

string clean_html(string input, array(string)|void keeptags)
if (!keeptags)
keeptags = ({ "p", "li", "b", "i", "ol", "sup", "a", "strong", "em", "img", "blockquote" });
mapping tags_found = ([ "tags":([]), "containers":([]) ]);

void collect_tags(object tag, string text)
if (tag->tag_name()[0]=='/')
tags_found->containers[tag->tag_name()[1..]] = "";
tags_found->tags[tag->tag_name()] = "";

// record the name of every opening and closing tag
// closing tags indicate containers


// because collect_tags is called for opening and closing tags separately,
// the opening tags can't be distinguished between tags and containers,
// but "tags" should only contain tags that are not containers

tags_found->tags -= tags_found->containers;
tags_found->tags -= keeptags;
tags_found->containers -= keeptags;

// now we have a list of tags and containers that we want to clean out
// couldn't find a way to reset _set_tag_callback, so use a new parser object instead.

return Parser.HTML()->add_containers(tags_found->containers)->add_tags(tags_found->tags)->feed(input)->read();

// TODO: handle containers where we want to keep the content and only
// remove the tags themselves.
// handle unwanted tag-attributes

anyone got a better wheel?

greetings, martin.
eKita - the online platform for your entire academic life
chief engineer eKita.co
pike programmer pike.lysator.liu.se caudium.net societyserver.org
BLUG secretary beijinglug.org
foresight developer foresightlinux.org realss.com
unix sysadmin
Martin Bähr working in china http://societyserver.org/mbaehr/