Stripping html tags using TagSoup

I had a situation, when converting old blog posts to WordPress, where I wanted to strip all the extra info on the pre tags. For example this:

<pre><code><span style="">&gt;</span> <span style="color: blue; font-weight: bold;">import</span> <span style="">Data</span><span style="">.</span><span style="">Char</span>

would turn into:

>import Data.Char

It turns out that this is really easy using TagSoup.

module Detag where

import Control.Monad
import Text.HTML.TagSoup

The function to strip tags works on a list of tags of strings:

strip :: [Tag String] -> [Tag String]

strip [] = []

If we hit a pre tag, ignore its info (the underscore) and continue on recursively:

strip (TagOpen "pre" _ : rest) = TagOpen "pre" [] : strip rest

Similarly, strip the info off an opening code tag:

strip (TagOpen  "code" _ : rest) = strip rest
strip (TagClose "code"   : rest) = strip rest

If we hit a span, followed by some text, and a closing span, then keep the text tag and continue:

strip (TagOpen "span" _ : TagText t : TagClose "span" : rest)
  = TagText t : strip rest

Don’t change other tags:

strip (t:ts) = t : strip ts

Parsing input from stdin is straightforward. We use optEscape and optRawTag to avoid mangling other html in the input.

main :: IO ()
main = do
    s <- getContents
    let tags = parseTags s
        ropts = renderOptions{optEscape = id, optRawTag = const True}
    putStrLn $ renderTagsOptions ropts $ strip tags

Example output:

$ runhaskell Detag.hs 
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span style="">&gt;</span> <span style="color: green;">{-# LANGUAGE RankNTypes          #-}</span>
<pre>> {-# LANGUAGE RankNTypes          #-}