One thing we don’t like about HTML online editors is that they make some pretty lousy looking HTML pages. To deal with this, we’ve created HTML “scrubbers” to rewrite HTML coming from these widgets.
The first thing we always do is call TIDY () to normalize the HTML. We then run a list of regular expressions to remove things we don’t like, such as class names, ids, etc. and also things such as trailing empty paragraphs at the end of documents.
We just added another scrubber to convert double BRs within P paragraphs into paragraph splits – this makes the HTML more semantic, that is, to make it say what it means, not what it looks like.
This can’t be done with just a regular expression, of course. Here’s our algorithm:
- find all <p>…</p> paragraph blocks, always looking for the shortest matches
- reverse this list, so that we can rewrite the document without having to worry about adjusting search indices
- look at each match: if contains anything non-simple, leave it alone. Theoretically, since we’re coming out of TIDY this should be well formed and only contain markup like B, STRONG, ABBR, etc. but I never take chances
- if the match is simple, convert all BR BR sequences to “</p><p>”
With the BlogMatrix Platform editor, every time you save a post it scrubs it and sends it back to the editor.




