BlogMatrix
 

HTML scrubbing with TinyMCE

edit David P. Janes 2008-03-10 08:56 UTC add comment  ·  ·

One thing we don’t like about HTML online editors is that they make some pretty lousy looking HTML pages. To deal with this, we’ve created HTML “scrubbers” to rewrite HTML coming from these widgets.

The first thing we always do is call TIDY () to normalize the HTML. We then run a list of regular expressions to remove things we don’t like, such as class names, ids, etc. and also things such as trailing empty paragraphs at the end of documents.

We just added another scrubber to convert double BRs within P paragraphs into paragraph splits – this makes the HTML more semantic, that is, to make it say what it means, not what it looks like.

This can’t be done with just a regular expression, of course. Here’s our algorithm:

  • find all <p>…</p> paragraph blocks, always looking for the shortest matches
  • reverse this list, so that we can rewrite the document without having to worry about adjusting search indices
  • look at each match: if contains anything non-simple, leave it alone. Theoretically, since we’re coming out of TIDY this should be well formed and only contain markup like B, STRONG, ABBR, etc. but I never take chances
  • if the match is simple, convert all BR BR sequences to “</p><p>”

With the BlogMatrix Platform editor, every time you save a post it scrubs it and sends it back to the editor.

Rich text editing with TinyMCE

edit David P. Janes 2008-03-09 11:47 UTC 1  comment  ·  ·

Well, I couldn’t believe how easy it was to make our new editor use TinyMCE – I just downloaded the new version (3.0.4.1 – http://tinymce.moxiecode.com/download.php), hooked it up to our code and it ran out of the box.

Since you may not have done this yourself, I’ll just run you through how we use TinyMCE:

  • make a TEXTAREA that you plan to work with; there are some complications if you want or have multiple TEXTAREAs but this is not an issue for us
  • include TinyMCE: <script type="text/javascript" src="=/jscripts/tiny_mce/tiny_mce.js"></script>
  • call the initalizer function: tinyMCE.init(initd)

That’s it, you have an editor. “initd” is a dictionary that describes how to set up TinyMCE. This is our setup:

initd = {
    onchange_callback : "tinymce_onchange_callback",
    theme_advanced_buttons1 : "bullist,numlist,outdent,indent,separator,justifyleft,justifycenter,separator,link,unlink,image,separator,bold,italic,strikethrough,separator,sub,sup,forecolor,backcolor,separator,code",
    theme_advanced_buttons2 : "",
    theme_advanced_buttons3 : "",
    dialog_type : "modal",
    theme_advanced_resize_horizontal : false,
    entity_encoding : "numeric",
    force_p_newlines : true,
    force_br_newlines : false,
    convert_newlines_to_brs : false,
    relative_urls : false,
    remove_script_host : false,
    verify_html : false,
    auto_reset_designmode : true,
    remove_linebreaks : false,
    theme_advanced_resizing : true,
    mode : "textareas",
    theme : "advanced",
    theme_advanced_toolbar_location : "top",
    theme_advanced_toolbar_align : "left",
    theme_advanced_path_location : "bottom",
    plugins : "inlinepopups",
    content_css : "/:root/include/common/tinymce.css"
};

You’ll have to modify the location of the CSS file and the callback (so we know whether the document has been edited!) to something you prefer to use, but you get the idea.

Also note that you may have to do some magic to move the data between the TinyMCE window and the TEXTAREA:

  • To move the data into the TEXTAREA, do: tinyMCE.triggerSave()
  • To go the other way: tinyMCE.updateContent(idName), where idName is the DOM ID of the TEXTAREA; whoops -- in version 3.x use tinyMCE.activeEditor.load();

Version 2 of TinyMCE misbehaved if you tried to create an editor in a hidden DIV (i.e. with display: none); I’m not sure if this issue is gone or not but try to avoid doing it.

 
 

The need for speed; and the solution

edit David P. Janes 2006-08-07 20:44 UTC add comment  ·  ·  ·

I've got page loading time on this site -- for constructed pages1 -- down to near 1 second times. Most of this one second is coming from network and rendering delays, which I'll have to sort out later -- locally I can curl the page in 0.065 seconds!). As previously documented, I've already done the following:

After a lot of mulling today, I've made another big improvement. Formerly, we used to load information about the user's session from a URI called '/:admin/status/'. This returned three pieces of critical information: the IHOST, the USERID, and the HOME. The IHOST is the installation host (semantic.blogmatrix.com), the USERID is the user you are logged in as (or the empty string), and HOME is set only if you serve your pages from a different URI than the default2.

This caused rendering to pause for .75 to 1.5 seconds depending on how well the network was responding. Effectively, the made the site feel really sluggish.

We now do the following: IHOST is just built into the templates; USERID and HOME are loaded into Cookies when the user is logged in. When these values are needed, instead of taking them out of Javascript variables, we call functions that pull them out of Cookies.

Instant speed. 

1. We don't work under the same model as TypePad or Blogger. We only put a page together when we don't have it in cache. This could take several more seconds. Once a page is constructed, we'll always serve it from cache until the cache is invalidated (say, by a new post or comment being added).

2. For example, I serve my personal blog as http://blog.davidjanes.com even though deep down it's really http://davidjanes.semantic.blogmatrix.com!