BlogMatrix
 

HTML scrubbing with TinyMCE

edit David P. Janes 2008-03-10 08:56 UTC add comment  ·  ·

One thing we don’t like about HTML online editors is that they make some pretty lousy looking HTML pages. To deal with this, we’ve created HTML “scrubbers” to rewrite HTML coming from these widgets.

The first thing we always do is call TIDY () to normalize the HTML. We then run a list of regular expressions to remove things we don’t like, such as class names, ids, etc. and also things such as trailing empty paragraphs at the end of documents.

We just added another scrubber to convert double BRs within P paragraphs into paragraph splits – this makes the HTML more semantic, that is, to make it say what it means, not what it looks like.

This can’t be done with just a regular expression, of course. Here’s our algorithm:

  • find all <p>…</p> paragraph blocks, always looking for the shortest matches
  • reverse this list, so that we can rewrite the document without having to worry about adjusting search indices
  • look at each match: if contains anything non-simple, leave it alone. Theoretically, since we’re coming out of TIDY this should be well formed and only contain markup like B, STRONG, ABBR, etc. but I never take chances
  • if the match is simple, convert all BR BR sequences to “</p><p>”

With the BlogMatrix Platform editor, every time you save a post it scrubs it and sends it back to the editor.

Internals: pinging, "stealth saves"

edit David P. Janes 2006-09-03 23:38 UTC add comment  ·  ·  ·

We've changed bm_disk_entry.DiskEntry.Save to add a new parameter: "stealth".

  • by default, stealth = False
  • if stealth = True when saving:
    • no monitors are notified of changes
    • the mtime and atime of the "index.pyd" files are preserved
    • timestamps are not updated
  • the purpose is to allow writing scratch-pad or administrive information without triggering unnecessary workflow. For example:
    • when recording that "ping" was sent
    • when recording that an email was sent
  • to support pings, we've added two more functions to DiskEntry
    • SetPing(ping_service, data)
    • GetPing(ping_service)

Internals: bash 3.1 chokes mod_python 3.2.8

edit David P. Janes 2006-08-25 20:05 UTC add comment  ·  ·

I was rebuilding the BlogMatrix Platform on a Fedora Core 5 box this morning when I noticed mod_python wasn't being properly set up. The issue turns out to be that mod_python fails to build:

./configure: line 3427: syntax error near unexpected token `('
./configure: line 3427: `  as_lineno_3=`(expr $as_lineno_1 + 1) 2>/dev/null`'
+ exit 1

Through the magic of Google, we quickly found a fix:

A bug in bash 3.1 causes configure to fail. This has been reported on recent versions of Gentoo and and discussed on the mod_python mailing list:
http://bugs.gentoo.org/show_bug.cgi?id=118948
http://www.modpython.org/pipermail/mod_python/2006-January/019965.html
http://www.modpython.org/pipermail/mod_python/2006-January/019969.html

According to the gentoo bug report, the problem in configure.in is the double backslash escape sequence in the line:
MP_VERSION=`echo $MP_VERSION | sed s/\\"//g`

Changing this to:
MP_VERSION=`echo $MP_VERSION | sed s/\"//g`
fixes it for bash 3.1.

I wonder why mod_python is using \\" since the gentoo fix seems to work ok with bash 3.0 (and GNU sed) just as well. Is it there to support other shells, other sed versions, older bash versions... ??

I suggest mod_python adopts the gentoo fix, or avoids the problem altogether by using tr. eg.

MP_VERSION=`echo $MP_VERSION | tr -d '"'` 

"Upcoming Events" Gadget

edit David P. Janes 2006-08-22 10:32 UTC add comment  ·  ·

And important new feature that will be in the official V10 release are "gadgets" (following the Microsoft terminology), which are basically the little boxes on the sidebar, selectable by the user.

On the sidebar you can see a "Upcoming Events" Gadget. This one is quite simple: it lists entries in date order that have "Event" extensions added. From a code perspective, it's quite simple to implement too since everything is retrievable and sortable using tag queries:

self.events = list(bm_disk_entry.GeneratorDB(
    folker,
    userid = self.page.page_userid,
    tags = [
        ":bundle:event",
        ":is:published",
        "|:event:when:|sort",
        "|:event:when:|ge|%04d-%02d-%02d" % tuple(time.gmtime()[:3]),
    ],
))

This is also hCalendar enabled, though some of the little details haven't been worked out yet.

The need for speed; and the solution

edit David P. Janes 2006-08-07 20:44 UTC add comment  ·  ·  ·

I've got page loading time on this site -- for constructed pages1 -- down to near 1 second times. Most of this one second is coming from network and rendering delays, which I'll have to sort out later -- locally I can curl the page in 0.065 seconds!). As previously documented, I've already done the following:

After a lot of mulling today, I've made another big improvement. Formerly, we used to load information about the user's session from a URI called '/:admin/status/'. This returned three pieces of critical information: the IHOST, the USERID, and the HOME. The IHOST is the installation host (semantic.blogmatrix.com), the USERID is the user you are logged in as (or the empty string), and HOME is set only if you serve your pages from a different URI than the default2.

This caused rendering to pause for .75 to 1.5 seconds depending on how well the network was responding. Effectively, the made the site feel really sluggish.

We now do the following: IHOST is just built into the templates; USERID and HOME are loaded into Cookies when the user is logged in. When these values are needed, instead of taking them out of Javascript variables, we call functions that pull them out of Cookies.

Instant speed. 

1. We don't work under the same model as TypePad or Blogger. We only put a page together when we don't have it in cache. This could take several more seconds. Once a page is constructed, we'll always serve it from cache until the cache is invalidated (say, by a new post or comment being added).

2. For example, I serve my personal blog as http://blog.davidjanes.com even though deep down it's really http://davidjanes.semantic.blogmatrix.com!

Using Amazon S3 to serve static files

edit David P. Janes 2006-08-06 12:11 UTC 2 comments  ·  ·  ·  ·  ·

The V10 look and feel (i.e. what you're seeing here) uses a substantial number of GIF files to achieve the candy-like "web 2.0" look. Additionally since we're using a fair number of javascript include libraries (MochiKit, TinyMCE, Yahoo UI), what we're ending up is a lot of trips to our server the first time a user sees a page.

This equals unnecessary slowness and page response time. What I'd like to do is speed this up a little (or maybe a lot) by offloading serving these mostly static files. Right now I'm experimenting with Amazon S3 and I'll document what I'm doing.

I'm not breaking any ground here: Adrian Holovaty did this first to offload pages from Chicago Crime and has documented his experiences. I'm just going to expand and annotate what he wrote (the blockquoted italics text is his):

It was easy to get this working; took less than an hour total. Here's what I did:

First, I signed up for an Amazon S3 account. Do that by clicking "Sign Up For Web Service" on the main S3 page. When you sign up, you get two codes: an access key ID and secret access key.

You'll need to provide a credit card to pay for your (as-you-use-it) Amazon S3 services. You have to click on a provided link to get the keys. There's a X.509 certificate (rather than secret key) way of accessing your S3 account but it only works with SOAP and I'd rather stick a fork in my eye and wiggle it around first. Moving right along...

Next, I created an S3 "bucket" for my chicagocrime.org media files. An account can have multiple buckets. As far as I can tell, it's just a way of keeping your S3 stuff in separate containers. I did this by using the free S3 Python bindings. Just download the file, unzip it and put the file S3.py somewhere on your Python path. To create a bucket named 'mybucketname', do this:

import S3
conn = S3.AWSAuthConnection('your access key', 'your secret key')
conn.create_bucket('mybucketname')

I found it easier just to distutil S3.py into my standard Python library:

from distutils.core import setup
setup(
        name='S3',
        version='20060805',
        py_modules=['S3'],
)

I created a bucket called 'semantic.blogmatrix.com'

Next, I wrote a Python script that uploaded my media files to this bucket and made them publically readable. S3 has a bunch of complex authentication stuff, but all I wanted to do was use S3, essentially, as a Web hosting service. Here's the script I used, and here's how to use it:

cd /directory/with/media/files/



find | python /path/to/update_s3.py

The script is kind of cool because it uses Python's mimetypes to determine the content type of each file in order to pass that to the S3 API. Otherwise it's pretty straightforward.

I've written my own little program (attached) to do this which takes care of all the path searching, etc.. I'll probably modify it some more to track what it's uploaded so we don't multiple upload files. Here's the help:

blogmatrix.v10@s002. python S3Uploader.py --help
usage: S3Uploader.py [options]

options:
  -h, --help            show this help message and exit
  --debug              
  --bucket=BUCKET       Amazon S3 Bucket
  --access-key=ACCESS_KEY
                        Amazon S3 Access Key (required)
  --secret-key=SECRET_KEY
                        Amazon S3 Secret Access Key (command line prompt if
                        missing)
  --root=ROOT           All directories are made relative to this (optional)
                        root
  --directory=DIRECTORIES
                        Upload files from this directory (default: .)
  --extension=EXTENSIONS
                        Upload files matching this extension (default: all
                        files)

For example:

python S3Uploader.py \
--bucket semantic.blogmatrix.com \
--access-key 0ZB0XFMV5NE1KM15DKR2 \
--extension gif,jpg,png,css,js \
--directory v10/media \
--root ~/htdocs

Finally, it's a matter of plugging in the changed files. Adrian does it like this:

Finally, it was just a matter of changing my chicagocrime.org templates to point to S3's URLs rather than my own URLs. That was a snap, thanks to Django's template inheritance and includes.

We do it with Apache rules:

RewriteRule ^/:root/(silk_icons/.*.png)$        http://s3.amazonaws.com/semantic.blogmatrix.com/$1  [R,L]
RewriteRule ^/:root/(v10/media/.*)$             http://s3.amazonaws.com/semantic.blogmatrix.com/$1  [R,L]
RewriteRule ^/:root/((MochiKit|tinymce|yui)/.*)$                http://s3.amazonaws.com/semantic.blogmatrix.com/$1  [R,L]

 

You're seeing the result here. All the background graphics and external javscript libraries are coming from Amazon S3.

Attached Documents:

Internals: tracking all actions against a page view

edit David P. Janes 2006-08-02 12:40 UTC add comment  ·  ·  ·

One of the nice things -- that took me a while to realize -- is that you're working in a single treaded environment in mod_python. One that's getting reused over and over and over, but while you got it you got it.

I've added an extension to bm_log to allow tracking of all logging related to creating a page. You use it as follows:

  • call Log.PageStart
  • do stuff
  • call Log.PageEnd

In fact, you don't even need to do this, since PageStart and PageEnd are always called in bm_page in the right places, so all you're responsible for is the "do stuff" part.

Now what does this do? Simple:

  • PageStart assigns a unique (random) id the logging function and records the start time
  • all subsequent calls to Log print the random id and the delta from the start time
  • PageEnd clears the unique id
 

Internals: Using mod_rewrite to let BlogMatrix serve hostnames

edit David P. Janes 2006-07-23 11:13 UTC add comment  ·

One thing I'm planning to do in the very near future is to move my weblog AND my del.icio.us links all over to (semantic.)blogmatrix.com. This will give me a weblog with 5000+ entries and lots of comments to test the edges of our code. I've already offered a bounty to help us write a MovableType converter, though it looks like I'll have to finish writing this myself.

Here's the changes (in bold) I had to make to our primary server's httpd.conf to make this work:

<VirtualHost *:80>
 ServerAlias *.semantic.blogmatrix.com
 ServerAlias weblog.davidjanes.com

 # special rules for users with their own blog name
 RewriteCond %{HTTP_HOST}   ^weblog.davidjanes.com$
 RewriteRule ^.*$           @davidjanes.semantic.blogmatrix.com$0


 # host based rewriting
 RewriteRule ^[@](.+)       $1 [S=1]
 RewriteRule ^.+            %{HTTP_HOST}$0
</VirtualHost>

Notes:

  • "[S=1]" means skip the next line; if we're rewriting for a user host we don't need the standard rule that inserts the username.semantic.blogmatrix.com into the rules
  • We'd just repeat the "# special rules" section as many times as we needed. The RewriteCond only applies to the next line
  • We'll need some way of making this very dynamic so user's can just set up the hostname without operator intervention

There's going to need to be more recognition of the user's preferred host name in the code, but I'll work on this later.

Internals: Apache rewrite rules for Python attachments

edit David P. Janes 2006-07-18 10:55 UTC add comment  ·  ·

We were having a problem with serving .py files as attachments this morning. It turns out the problem was in the primary rewrite rules so I've made this rule:

RewriteRule   ^([^.]+)\.[^/]+/+(\d\d\d\d/\d\d\.\d\d/\d\d\d\d+/+.*)  http://semantic.blogmatrix.com/users/$1/podcasts/$2  

The first in it's group, that is, we'll attempt to serve attachments (recognized by all the \ds) before we look for .py files (in which URIs are normally passed straight through for mod_python handling by the secondary Apache).

Internals: configuring admin-post for a particular application

edit David P. Janes 2006-07-13 12:48 UTC add comment  ·  ·  ·

This post is about the internals of how the BlogMatrix Platform works.

One of the most complex pieces of code in the BlogMatrix Platform is the admin-post command, the command that lets you make entries (The most complicated piece is genpage, which builds and caches webpages). Much of the complexity is because it can be configured in many different ways, depending on what the client application is (we do a lot more than blogging).

Much of the configuration comes from a data element called self.sections, which tells what to add and what strings to use and so forth. When bundles are added to a post, they can selectively

Right now I'm working on a Sales Lead bundle, for demonstration purposes. To make this more clear, instead of "Edit Entry" we want to display "Edit Sales Lead"; and instead of displaying "Title" for the title of the entry, we want to use "Opportunity". Here's how we do it (in blogmatrix.cfg):

bundle_saleslead: {
    "title" : "Sales Lead Tracking",
    "bundle_id" : "saleslead",
    "is_edit_hidden" : True,

    "post_sections" : {
        "TITLE" : "Edit Sales Lead",
        "TITLE_TITLE" : "Opportunity",
    },

Also note that the is_edit_hidden flag means it won't display on the "Add Structured Data" sidebar and once added, it can't be deleted. How do we add it in the first place, you ask? Just add to the post URI:

 /:admin/create/post/?bundle=saleslead,person,phone

Apache's mod_deflate

edit David P. Janes 2006-07-13 12:21 UTC add comment  ·  ·  ·

We've enabled mod_deflate on our Apache2 installation. This means that we'll only be sending about 10-20% of the data over the wire for our big fat HTML, JS and CSS files as the data will be GZIP compressed.

If you're considering using Apache2, you must explicitly enable it while building, i.e.:

./configure --enable-mods-shared=most --enable-deflate

Right now, this is what I've added to our config file:

LoadModule deflate_module modules/mod_deflate.so
AddOutputFilterByType DEFLATE text/html text/plain text/xml application/x-javascript text/css

There's probably more mods coming to this yet though.