BlogMatrix
 

David Janes' Code Blog

edit David P. Janes 2008-11-24 12:31 UTC 1  comment

I am now posting on a regular basis on the David Janes' Code Blog. Come join me over there for daily postings about anything and everything that interests me in the code world or even directly subscribe to my feed here.

Java/JSP vs. Python and the 21st Century

edit David Janes 2008-07-18 12:14 UTC 2 comments  ·  ·  ·

I'm just starting a project that's using JavaServer Pages (not Java Server Pages) aka JSP for doing "templates" -- i.e. making dynamic HTML content. The first thing I'm trying is really quite simple -- create an object and print out one of it's values.

Here's the relevant class:

public class Entry {
  String hour;

  public String getHour () {
    return this.hour;
  }
}

Java/JSP is big on setters and getters, that is, functions named a certain to get access to object attributes. There's plus and minuses to this, but from a conceptual level I thought it was quite clear that we are thinking "Entry objects have an attribute 'hour'", even if we have to use setHour/getHour to access values.

Now, in Cheetah, Django (and probably Rails, though I haven't looked) you'd access the attribute 'hour' in a template language as follows:

e.hour

Simple -- the last thing you want to be doing in a HTML template is dumping tons of programming language code into it, making it hard to read and maintain. Cheetah is quite clever about this: it looks at 'e' and does all sorts of introspection to figure out how to get 'hour', so you don't have to do that yourself.

So how do we do this in JSP (like in 2008)?

e.getHour()

Give me a break. Maybe there's another way to do this? If there's performance reasons to be wedded to this format (though I can't see it since everything is compiled) why not invent a new syntax like

e::hour

Playing with Django

edit David P. Janes 2008-06-16 11:30 UTC add comment  ·

Here's a few notes I made while working my way through Django's tutuorial:

Installation

Installation of Django is easy. Just following the instructions on this page:

It can either be downloaded as a tarball or via SVN. We've chosen the tarball (0.92.2) option.

Nitpicking, the tarball URL doesn't end with the name of the tarball - it's a directory. This is only slightly annoying.

Our environment has as per-user Python installation, so we don't need to use "sudo" to install the Django packages

Testing (Part 1)

We are following the instructions son this page:

Running a development webserver

Django has options for running under mod_python, WSGI or using a standalone built in webserver. The built-in webserver is documented as not-for-production, but it's good enough to get going so we're going to play with that for now. Eventually we expect to use both the mod_python (because we have it here) and WSGI options (because it's both the way to go and the most efficient).

We immediately ran into issues running the webserver because it's not on the "localhost" - the browser said it couldn't find the server. In our environment we ssh to a linux server and access that from our desktop computer. After a little googling, the trick turns out to be adding a "IP:port" argument to the Python command:

python manage.py runserver 192.168.1.10:8000

Connecting to the DB

Next in the instructions is connecting to the DB. Again, we have issues here - no fault of Django - because of the uniqueness of the local environment. Like Python we run MySQL on a per-user basis so we need to be able to specify a fairly unique setup with a TCP/IP and UNIX domain socket. It's not clear initially how to specify the path for the UNIX domain socket but fortunately the stack traceback and clearly written code come to our rescue here: you use the DATABASE_HOST and the code looks to see if it's a path.

Creating the DBs

The ‘syncdb' asked if I'd like to create a super user for the DB. I said no and everything seems to be working - a bunch of tables show up in MySQL as promised.

Creating an App

An "app" is something that does something or something like that. The command works as promised.

Create the Model and DB tables

Works as promised. There appears to be some underlying cleverness happening as (for example) the "sqlclear" command knows whether the tables are their or not.

The "sqlinitialdata" refers you to another command; this is probably a mismatch between the code and the tutorial.

Interactive Shell

This is cute - running "python manage.py shell" will drop you in the environment that the webserver sees so you can do stuff on the command line to see how it will work.

Stylistically, I differ from Django in that I (almost) never do anything except ‘import x', as I prefer to use the dotpath and not have to guess when I'm skimming code as to where functions are coming from.

As an aside, it would be cool if Python examples did the ‘>>>' as an image or as an CSS trick so they don't get copied when cut-and-pasting examples.

Testing (Part 2)

And we're on to page 2 of the tutorial:

Admin Interface

And we get our first "uh oh" - we didn't create a "superuser account" on way back above, well, just because. Fortunately, Lord Google knows all and we quickly find out that you can run Python commands to do this and we're in business:

$ python manage.py shell
>>> from django.contrib.auth.create_superuser import createsuperuser
>>> createsuperuser()


Thanks to here for the tip (which also looks like interesting reading):

Admining the App

Nice - add an inner class declaration to App "Poll" and it now shows up in the Admin interface and we can do stuff with it. Neato.

I've also just discovered that I don't need to stop and restart the app - it's using reload().

Adding __str__ to the model makes the Admin app useful; if you do something wrong the HTML function is really quite clear.

You can change the way the Admin interface displays Poll by adding a ‘fields' declaration. This is a little bizarre - it's a tuple, of tuples with a dictionary inside. Why not a list of dictionaries?

Bug: if you use the ‘classes/collapse' option of the admin interface and there's a bug while saving an item while the item is collapsed, there's no indication of where the error is.

Changing Templates

The instructions are a little confusing - pay attention - but it works as advertised. There's a command called ‘adminindex' which dumps out template code but I'm somewhat confused by it and I wish there was a little more detail here.

Testing (Part 3)

And we're on to page 2 of the tutorial.

Design your URLs

Django converts URLs coming from user requests into actions by:

  • looking at ROOT_URLCONF in settings.py
  • loading mysite/urls.py (as defined in the settings)
  • sequentially looking at regular expressions
  • loading a module - typically a view - and calling a function defined by the matching regular expression. This is expressed as a dotpath - e.g. ‘x.y.z' will load module ‘x.y' and call function ‘z'
    • the function is called with a request object and all the named groups from the regular expression - clever
  • also note that:
    • you cannot filter on the hostname in the URL (problem?)
    • regular expressions are compiled (and thus are very fast)

The code samples work as advertised. In a later section we'll learn that you can move all the URLs into the app so that you don't have to (a) put all the URL information at the project level (b) decouple the base path of the URL being used.

Write views that do something

This section explains how to returning meaningful results building on the knowledge in section 2.

  • the template loading system rocks
  • the shortcuts are very handy (i.e. lots of common actions are compressed into single function calls)

Playing with forms (Part 4)

This section is a mess. I'll probably retackle it.

Scaling Twitter, and similar - a few ideas

edit David P. Janes 2008-05-26 21:36 UTC add comment  ·  ·

Introduction

This is a repeat of an architecture I worked on several years ago for a high performance/ real-time financial risk management system involving not-unsimilar scaling problems to what Twitter is doing. It works, and it works well.

This post is inspired by the Dare Obasanjo post (http://www.25hoursaday.com/weblog/2008/05/23/SomeThoughtsOnTwittersAvailabilityProblems.aspx) about Twitter's scaling issue, and by noting the major saving grace - most calls are looking for the last 20 items, so it doesn't necessarily matter that Scoble has 21,000+ followers because he's never going to see the damned posts also.

This article is just a sketch of how this would work; I acknowledge the devil is in the details. Onaswarm does not use this architecture and will hit similar walls as Twitter if it ever is used at its level. And obviously, it's all hypothetical since I'm not actually doing this. If I was implementing this I would be doing it in Java or C++.

Overview of the problem

  • Twitter sucks, probably because of certain pathological edge cases requiring either large numbers of reads or large numbers of writes, or both

Overview of the solution

In case you don't want to read everything below, here's what we end up with:

  • one 64Gb machine that stores recent timelines for 30m+ users
  • one 64Gb machine that stores last 200 million messages
  • a switch Gb Ethernet connection between machines
  • a number of smaller machines for easier tasks
  • uni-directionally pass messages from machine to machine
  • messages can be batched, pipelined, are small not exceeding 4K
  • 10,000 requests per second (pending further analysis review)

Overview of how it works

  • one-way message passing along always-open connections
  • one processor per-major task
    • keep code in the instruction cache at all costs
  • the core is a few dedicated machines passing messages across the network
  • messages are very small - a few hundred bytes to 4K. Multiple messages can (and will be) sent all at once.

Overview of benefits

  • horizontally scalable - just keep adding machines for load, replicating the entire setup if need be
  • this should work to about 30 million users
  • after about 30 million users another level is needed, but this doesn't overly complicate
  • incredibly high throughput

Overview of assumptions

  • every message has a unique ID (MID) that fits in 8 bytes
  • every user has a unique ID (UID) that fits in 4 bytes
  • the average number of friends per-user is 64 (http://www.kottke.org/07/03/twitter#26563)
  • twitter messages are 140 bytes
  • there's a couple of big-assed machines with lots and lots of memory (64Gb) and fast efficient (Gb) network connections; these machines can be purchased for about $20,000.
  • favor frequent queries, service infrequent queries, punish highly infrequent or improbably queries
  • there may be a throttling mechanism added to this
    • this could be requiring an ACK/NACK every so many messages, or
    • a separate channel between components
  • There's a lot more going on in the API which we do not address here for brevity - in particular:
    • friend list modifications
    • notification on keywords
    • account information has to be looked up!

Networking assumptions

  • Gb Ethernet
  • 4K max messages
  • 1024 * 1024 * 1024 / 8 / 4096 = 32768 messages/second - let's call it 10,000 because we won't probably get peak speed

Servers / Components

Every server (except the FES) has three types of thread:

  • reader threads
  • one "task" thread that does stuff that the reader asks and places the result on the writer thread's queue (might use one per core)
  • writer threads

FES

The "Front End Server" - i.e. probably an HTTP process that accepts web and API requests. There are lots and lots and lots of these, as needed.

CS

The "Concentration Server" - there is one of these per N FESs, where N is something like 1024. The CS:

  • accepts requests from the FES
  • pumps the requests into the start of the pipeline
  • gets results from the end of the pipeline
  • passes the results back to the appropriate FES

These can be simple machines with a couple of Gb of memory.

DBWS

The Database Write Server:

  • writes new messages to the DB, returning the MID
  • passes on messages

The exact ratio for CS:DBWS or DBWS:TLS will have to be discovered by experiment, though I would probably base it on the first ratio.

These can be simple machines with a couple of Gb of memory.

TLS

The "Time Line Server" - this stores:

  • the last N entries for every user, where N is something like 128
  • the friends list for every user

On a 64Gb machine:

  • 100 (average number of friends) * 4 (size of UID) = 400 bytes is needed per account to store friends
  • 128 (size of TL we are storing) * 12 (size of UID + MID) = 1536 byes is needed per user

So let's say we need 2K per user - that's 33 million users. Scaling beyond this will require another level of merge sorting, but 33 million is a good start.

MCS

The Message Cache Server - this stores the last M messages via a dictionary/hash table, where M is a big number.

On a 64Gb machine, assuming 256 bytes is needed per-message we are looking at storing 268 million messages for fast retrieval!

DBRS

The DB Read Server - this backfills information the MCS could not retrieve.

The exact ratio for CS:DBWS or DBWS:TLS will have to be discovered by experiment, though I would probably base it on the first ratio.

These can be simple machines with a couple of Gb of memory.

Operations

Add Twit

There's probably no need for the complete round trip I've outlined in this flow below

  • FES
    • accepts "add twit" message
    • gets the UID for the user
    • sends message to the CS
    • waits for a result on it's two way socket (this is the only two way connection)
  • CS
    • accepts "add twit" message
    • sends message to the DBWS
  • DBWS
    • accepts "add twit" message
    • writes the DB
    • adds the MID to the message
    • calls the TLS
  • TLS
    • accepts "add twit" message
    • updates user's timeline with the MID and update time, tossing off the oldest entry in the TL
    • NOTE that the text of the twit is not stored!
    • calls the MCS
  • MCS
    • accepts the "add twit" message
    • adds the twit the cache
    • randomly or cleverly removes an old twit from the cache if memory is full
    • calls the CS
  • CS
    • accepts the "add twit" message; it knows that this is a completed operation
    • sends the result (i.e. "OK") back to the FES

And we're finished.

Computations

Note that computations involved:

  • CS: hash table operations, in memory
  • DBWS: 1 DB operation, memory + disk
  • TLS: O(1) index table operations, in memory
  • MCS: ~O(1) hash table operations, in memory

And also note the size of the message is almost certainly under 256 bytes. Note that the amount of work is constant, no matter if you are JoeBlow-like or Scoble-like.

Get Timeline

  • FES
    • accepts "get timeline" message
    • gets the UID for the user
    • sends message to the CS
    • waits for a result on it's two way socket (this is the only two way connection)
  • CS
    • accepts "get timeline" message
    • sends message to the TLS
  • TLS
    • accepts "get timeline" message
    • looks up all followers
    • start with the first follower, then loop through the remainder:
      • merge sort results
    • calls the MCS with the 20 MIDs to look up
  • MCS
    • accepts the "get timeline" message
    • looks up each MID
    • if all MIDs are found, calls the CS
    • if any MIDs are missing, calls the DBRS
  • DBRS (optional)
    • accepts the "get timeline" message
    • looks for all MIDs where the message could not be found
    • does a DB search for them and add them
    • calls the CS
  • CS
    • accepts the "get timeline" message; it knows that this is a completed operation
    • sends the result (i.e. all the necessary twitter messages) back to the FES

The DBRS could optionally also route the message through the MCS again to make the DBRS write them into cache!

Computations

We'll assume everything's in cache, but if they're not we're adding a single database read that will return 1 to 20 entries.

Normal User

Has:

  • 64 followers
  • assume average 10 integer date comparisons per timeline to do merge sort

So:

  • CS: hash table operations, in memory
  • TLS
    • O(1) index table operations, in memory - to find followers
    • 640 integer comparisons
  • MCS: 20 ~O(1) hash table operations, in memory

The message from the TLS to the MCS is less than 256 bytes (4 bytes for the MID, 4 bytes for the MID) * 20.

The message from the MCS to the CS is about 3K (140 byte message * 20 plus overhead)

Scoble-like Reader

Has:

  • 10000 followers
  • assume average 4 integer date comparisons per timeline

So:

  • CS: hash table operations, in memory
  • TLS
    • O(1) index table operations, in memory - to find followers
    • 40000 integer comparisons
  • MCS: 20 ~O(1) hash table operations, in memory

I.e. the only change from the previous step is the 40,000 integer comparisons, as opposed to 640. These are operating in process cache due to the one-machine one-task architecture and will be compute quickly - a multi-GHz machine can probably do hundreds of thousands of these per second.

Bottlenecks

Let's assume that we average 5000 operations in the TLS to merge timelines. With the network bottleneck at 10,000 requests per second, we need a CPU that can handle 50 million integer operations / second which seems to be an easy fraction of modern CPUs.

Attached Documents:

Safari file upload fails

edit David P. Janes 2008-05-25 10:49 UTC 1  comment  ·

In certain circumstances, Safari will hang when uploading documents, eventually causing the server to throw a read timeout error. This has been tested again Safari 3.1.1 but apparently happens with earlier versions too.

You can tell if you are experiencing this bug if:

  • files upload are hanging and eventually time out; this may happen intermittently
  • if you wait 1 (or 5 depending) minutes and then submit/upload, it works
  • you are likely working a fast local intranet connect rather than a slower internet connection
  • the form works fine in other browsers such as Firefox

This bug is ably described here (http://lists.macosforge.org/pipermail/webkit-unassigned/2007-January/026203.html) and apparently is something happening at a very low level of the TCP/IP stream. Unfortunately, Apple doesn't really seem to believe this is a bug (https://bugs.webkit.org/show_bug.cgi?id=5760).

I do not have a workaround for this, except by disabling Keep-Alive on Safari.

See also:

The Amazon MP3 Clips Widget

edit David P. Janes 2008-05-15 12:25 UTC 1  comment  ·  ·

Amazon (US) has just announced an "MP3 Clips Widget" (http://widgets.amazon.com/Amazon-MP3-Clips-Widget/):

Add music to your web site with the MP3 Clips widget. Search through Amazon's catalog of DRM-free MP3 music and addentire albums or select specific MP3 tracks to add to your widget. You can also showcase the latest Bestsellers from any musicgenre. If that isn't enough, your MP3 Clips widget can also automatically display the latest MP3 tracks you purchase onAmazon.com

How we'd like to us this

Once upon a time - and probably soon again - Onaswarm had an "Ads" widget in the sidebar which displayed (amongst otherthings) a list of recently played MP3 tracks, as shared with us via Last.fm. Clicking on the MP3 track would bring the readerto Amazon.com where they could play MP3 samples - and potentially purchase the track. A win-win-win for everyone involved.Unfortunately, because of the extra a steps involved plus an off-site navigation, this feature was somewhat underutilized.

How it works

  • go to the Amazon MP3 Clips Widget page
  • enter a search for music, by album or song title (or anything). You can also select Best Sellers or Recently Purchased.
  • a list of matching items appear, which you can add to the widget

This is all AJAX-y, so you're just working on one page. When you've selected all the items you want to appear on thewidget:

  • click Next Step
  • a widget size selector and interactive preview appear; the widget starts with the album cover displayed, clicking on this brings you down to the individual track level

If you're happy with your widget:

  • click "Add to my web page"
  • a popin appears with the appropriate OBJECT embed code, plus explanations of how to add it to many different blogging services

The Good

It's pretty cool, easy to use and works exactly as advertised. I could see this driving lots of MP3 sales to Amazon.

The Bad

You can't dynamically select what the widget is going to display, that is, you have to construct a widget for each set ofmusic you want to share. This makes it somewhat - well, totally - useless for the purpose we're outlining above. This would bemarginally tolerable if there was an API or something, but alas that option isn't there either.

The Ugly

The widget constructor doesn't work on Safari 3. It doesn't complain, it doesn't pop up errors, it just doesn't work.

If you're not logged in, it forgets all state after you do. Come on guys.

Amazon.com won't sell MP3s to Canadians.

Summing up

Nice, but needs a few minor UI tweaks. Desperately needs a way of dynamically constructing a widget at render time.

OpenID and LDAP

edit David P. Janes 2008-05-14 20:43 UTC 1  comment  ·  ·

Many enterprises - include most I would guess that are Microsoft-centric - use LDAP to establish user identity and profiles.In the Web 2.0 world, the emerging standard is OpenID. Is there a way to use OpenID to provide logins within the Enterprise buthave it backed by LDAP, the obvious benefit being one could install off-the-shelf intranet tools inside one's organization butnot have to LDAP-enable them or create a parallel account system

The OpenID-LDAP Project (http://www.openid-ldap.org/) offers such a tool.
We're testing this on a Macintosh, but there seems to be no reason this won't work on any UNIX-y system.

Installation

First, download an unpack the code into the web server directory.

$ cd ~/Sites
$ curl --location 'http://www.openid-ldap.org/releases/openid-ldap-0.8.5-noarc.tar.gz' >openid-ldap-0.8.5-noarc.tar.gz
$ tar zxvf openid-ldap-0.8.5-noarc.tar.gz

This extracts the code into a non-versioned subdirectory called ‘openid-ldap'. It would be much better form if the directorywas called ‘openid-0.8.5'.

Interlude: Enabling PHP on Mac OS X Leopard

Leopard has PHP but it has to be explicitly enabled by editing configuration files (if you haven't enabled Apache on yourMac, see the links below)

     $ su -
     # cd /etc/apache2
     # vi httpd.conf
     remove the hash sign on the ‘LoadModule php5_module' line
     # apachectl restart

Here are some helpful links if you need more information:

Running to Stand Still

Without configuring anything, let's see what happens when we visit the page:

  • http://localhost/~davidjanes/openid-ldap/

Note that URL is Leopard's way of referencing a user's (i.e. "davidjanes") local webpage.

A webpage appears with a field for entering a username - but not a password. Entering a username - e.g. dpjanes - redirectsus to the 404 page:

  • http://localhost/~davidjanes/openid-ldap/dpjanes

... which definitely wasn't expected.

Reading through their documentation, it looks like they're mainly doing this using SSL/HTTPS and to do that one has to addsome rewrite rules to the Apache configuration. Since we're not doing that - at least not yet - we're probably using aninfrequently used code path, thus hitting a bug. Perusing the code we should see the URL above should be internally rewrittento:

  • http://localhost/~davidjanes/openid-ldap/index.php?user=dpjanes

To fix this we have modify the Apache configuration again. Changing ".htaccess" does not work because Apache on Leopard isconfigured "AllowOverride None" which means the rewrites will be ignored

$ su -
# cd /etc/apache2/users
# vi davidjanes.conf

And then we add the following:

     RewriteEngine On
     RewriteBase /~davidjanes/openid-ldap/
     RewriteCond %{REQUEST_URI} ^/.*[/]([a-z][-a-z0-9_]*)$
     RewriteRule ([A-Za-z0-9]+)$  /~davidjanes/openid-ldap/index.php?user=$1 [P]

And then

     # apachectl restart

Note that these rules are predicated on that we're going to be logging in using OpenID's "uid" which will be lower caseletters, numbers, dash or underscore.

Configuring LDAP

This is obviously the part where we're going to part paths - everyone does LDAP their own way. We don't have an ActiveDirectory setup here, but we do have VMWare Fusion (http://www.vmware.com/products/fusion/) and a JumpBox for OpenLDAPappliance (http://www.vmware.com/appliances/directory/1105) so it should be just a simple matter of figuring out the rightcombination of configuration settings.
The OpenID appliance has the following configuration:

  • JumpBox Name: openldap 0.9
  • Application Page: http://192.168.1.120/
  • Management Page: https://192.168.1.120:3000/

I've already configured a few accounts on this, but for example we have a user:

  • o=Directory
  • ou=users
  • cn=David Janes

In LDAP terms this gives us a "Distinguished Name" which is the really way LDAP (as I understand it) uniquely identifies arecord. In this particular case our Distinguished Name is "cn=David Janes,ou=users,o=Directory".
This user has the following configuration:

  • cn: David Janes
  • gidNumber: 1000
  • givenName: David
  • homeDirectory: /home/users/default/dpjanes
  • objectClass:
  • inetOrgPerson
  • posixAccount
  • top
  • sn: Janes
  • uid: dpjanes
  • uidNumber: 1000

We're going to use "uid" as the login ID - note that this is by no means a universal choice nor is it universally availableon all LDAP servers. I've seen LDAP servers use "name" to provide a unique identifier and it's possible - maybe even probably -that many LDAP servers don't provide short unique names at all.

Note then how LDAP logins should probably work:

  • one provides a part of the record we are looking for, for example "uid=dpjanes", where the user at login time provides the "dpjanes" part and the configured application prepends "uid="
  • given a starting point - the "searchdn" in the configuration below - we look for a matching record
  • when we have the matching record, we get the Distinguished Name which uniquely identifies a record and that we ask LDAP to validate it with a password

Note that OpenID-LDAP doesn't actually work quite this way; we'll explain this further down.

Configuring OpenID-LDAP to contact LDAP

Following, the instructions in openid-ldap/docs/README.txt, especially point (5) we get the key points of configuration -edit "ldap.php" and fill in the values.

The original connection settings look like this:

'primary'  => '10.0.0.111',
'fallback' => '10.0.0.222',
'protocol' => 3,
'binddn'   => 'cn=<name>,cn=users,dc=domain,dc=local',
'password' => '<pass>',
'searchdn' => 'cn=users,dc=domain,dc=local',
'filter'   => '(&(cn=%s)(mail=*))',
'testdn'   => 'cn=%s,cn=users,dc=domain,dc=local',
'nickname' => 'uid',
'email'    => 'mail',
'fullname' => array('givenName', 'sn'),
'country'  => 'c'

Our new connection settings look like this:

'primary'  => '192.168.1.120',
'fallback' => '',
'protocol' => 3,
'binddn'   => '',
'password' => '',
'searchdn' => 'ou=users,o=Directory',
'filter'   => 'uid=%s',
'testdn'   => 'uid=%s,ou=users,o=Directory',

Note the reasons for this:

  • primary: as per the VMWare notes above
  • fallback: we don't have a backup server
  • binddn & password: it works without this; but we assume there's LDAP configurations that require you to login with a well-known Distinguished Name and password before you can do a search
  • searchdn & filter: the ‘%s' is replaced with the user's login name (i.e. from the login form) and then these items are put together to search for the user's record
  • testdn: when actually logging in, the ‘%s' is replaced as above; the page then tests the modified testdn with the password provided against the server

Note then the difference between OpenID-LDAP and our hypothetical login scenario in the previous section - OpenID-LDAPsearches for the login but after validating that it exists, ignores the Distinguished Name and just tries to log in using asimply constructed testdn and password. This works, but it strikes me that the search is either unnecessary or the loginprocedure is insufficient.

Failure

Alas, at this point we're going to have to stop, unless someone has a suggestion. When I attempt to log in with "dpjanes" weend up with OpenID-LDAP bridge trying to log in with "uid=dpjanes,ou=users,o=Directory", which simply doesn't work. Whetherthis is specific to my LDAP implementation or not is unknown.

If I alter the rules so that I'm logging in with "David Janes" / "cn=David Janes,ou=users,o=Directory" the (slightlymodified) Apache rewrite rules get confused because of the space. I could probably fix these but quite frankly I don't want tobecause I want "dpjanes" to be recognized as the login.

So, that's as far as I'm getting with this. If anyone has further suggestions, please let me know and I'll modify thisdocument and necessary.

Onaswarm Social Network Explorer

edit David P. Janes 2008-05-07 14:21 UTC 3 comments  ·  ·  ·

Onaswarm is now provides a interface for finding out the social network connectivity of webpages. Connections are discovered using XFN, hCard, FOAF, optionally Google's SGN services and in some instances custom APIs if account information is available to Onaswarm.

URI

Parameters

  • uri - the URI of a page you'd like to discover social network details for
  • wrapper - if "ajax", the results will be returned in JSON format
  • json_pretty - boolean; the results will be pretty printed
  • jsonp - if non-empty, the results will be placed in a JSONP wrapper
  • reverse - boolean; the results will reflect links to this page, as opposed to outbound from this page
  • google - boolean; add results from Google Social Graph API. In HTML mode, this defaults to True; in AJAX, False.
  • appkey - coming soon

Example Queries

form interface

all links from Twitter user "bvl" in HTML, augmented with Google SGN results

all links outbound from Twitter user "bvl", without Google SGN results

Dan Brickley's FOAF file

Notes

  • if our server is experiencing unusual loads, this API will return 503 errors
  • there's a lot more we could do with the FOAF files - tell us what
  • if we did FOAF output, would you use it?
  • we will be using appkeys to access to API in the near future, mainly to stop robots from crawling the web through our API!

Onaswarm is migrating

edit David P. Janes 2008-04-08 09:56 UTC 1  comment  ·

We're migrating the data in Onaswarm to a high performance network attached disk ... we expect we'll be back by 8 AM EDT.

Onaswarm and Metronauts

edit David P. Janes 2008-04-04 10:45 UTC add comment  ·

Onaswarm is pleased to announce that we’ve set up a “swarm” especially for the Metronauts / Transit Camp community. Your swarm – http://metronauts.onaswarm.com – will create a lifestream to capture and share all community posts, twits and photos about this community.

Signing up

If you don’t have an Onaswarm account, signing up is very easy:

You’ll then be lead through a set of simple steps to add your profile and social network information. You’ll automatically be added to the metronauts swarm.

If you already have an Onaswarm account:

Posting

ou must use the “metronauts” tag when posting in order for your post to show up on the Metronauts group:

  • on del.icio.us, Word Press and Flickr, add “metronauts” to the tag field
  • on Twitter and Pownce, use the hash tag “#metronauts” in your twit – you can see examples of this on the Metronauts swarm

Generally your post/twit/photo will show up in Onaswarm about 15 – 30 minutes after posting, depending on load. Del.icio.us feeds are only checked every hour, due to terms of use restrictions so they may take a little bit longer to show up.

Widgets

If you’re interested in displaying the Metronauts swarm lifestream on your blog or webpage, try adding our widget:

It’ll only take a few seconds.

XFN best practices

edit David P. Janes 2008-03-21 20:30 UTC add comment  ·

I woke up this morning with the intention of writing a "best practices" guide to doing microformats only to find out that Glenn Jones had beaten me (handily) to the task. In my mind this should be converted into a wiki page.

Fire Eagle

edit David P. Janes 2008-03-19 11:26 UTC 1  comment  ·  ·  ·  ·  ·  ·

Via the magic of Twitter, Twhirl and @dangerday, I’ve finally found myself in possession of a Fire Eagle (http://fireeagle.yahoo.net/) invite. Clicking through the link that was e-mailed to me, I logged in with my Yahoo ID and there I was – finally – in Fire Eagle. In case you’re not familiar with FE, here’s their brief description:

Fire Eagle is a service that helps users share their location online with their friends and with other sites and services. Find out more about the service by exploring below...

I.e. “twitter for location”, sorta.

he site itself is visually appealing, with large buttons and fairly obvious styled using YUI (http://developer.yahoo.com/yui/). And there’s a pretty background, in pseudo-Miami Vice colors.

You get to select how often they’ll check back with you to make sure I’m comfortable with sharing my location. A strange thing I’ll have to admit: if I stop sharing my location with FE, then I’m probably no longer interested in have you know where I am (or it’s Game Over man and I’m not worried about it). From other reviews I read I thought there was a way to fuzzy my location – i.e. just show what neighborhood, city, province or even country I’m in – but I can’t seem to find that option.

The first thing I tried in FE is “update your location”. Just for a laugh I entered “home”, but alas FE unsportingly offered a list of places called “Home” (and 奉免) no doubt populated by some very boring people. More seriously, it would be nice if I could enter “home”, “glenn’s office”, “doug’s house”, etc. as that more corresponds to my idea of location and is way more semantic. Perhaps this feature is coming.

Next I entered my (Canadian) postal code and bingo, there I am: a pin in a map. Then I entered “YYZ” to see if FE understands airports and yes it does. Then I tried to go back home, only to discover that it doesn’t seem to track previous locations. The INPUT field does respond to the down arrow, but it still shows “home” where I never was apparently and when I do select something, it doesn’t fill in the field. Sigh. Well, I know what it’s like to be in Beta (http://www.onaswarm.com).

Then I gave the “Application Gallery” a try. Alas, three applications (Fire Eagle Badge , Fire Eagle on Facebook and SMS Updates! [sic]) are listed, but none of them are there yet.

So where FE stands right now is it's a developer platform. If you're not a developer I wouldn't rush out of my way to get an invite. I’m going to play with this over the next few days and see how that works out. Here’s some brief notes:

  • there doesn’t appear to be any option for providing a non-protected update stream. Really, I don’t mind providing this information, if I can fuzzy it up
  • results are available in a custom XML format (boo) and in JSON. Why not GeoRSS or Atom?
  • authentication is done using OAuth (http://oauth.net/). Is Yahoo all OAuth now? Something to check out. Probably not
  • there’s an excellent selection API kits: Javascript, PHP, Perl, Python and Ruby. No Java? Well it’s official: Java is the new COBOL – you’re on your own!
  • there’s an API for updating location so it seems that you, for example, have a Twitter client that updates your location on FE. Or something that looks at your Calendar or TripIt agenda (http://www.tripit.com) and makes the appropriate updates.

Two Onaswarm issues

edit David P. Janes 2008-03-17 01:56 UTC add comment

There's too major/minor issues we're trying to solve with Onaswarm right now:

  • when new Swarms are created, users don't seem to be showing up in the results, at least for a while
  • when new Feeds are added, they're not being prioritized properly. Ideally we'd like to see new feeds show up within seconds.

We're calling these major/minor problems because although they're affecting functionality in a nasty way, we expect the fixes to be rather small. And implemented ASAP...

A typical day working with RDF and FOAF

edit David P. Janes 2008-03-16 09:52 UTC 3 comments  ·  ·  ·

I’ve been trying to use FOAF to get profile and friendship/contact information across social networks. I’ve done the “friend” part, I just need to fill in the profile information.
Now, getting this information out of FOAF is problematic at best. Using Python, the rdflib library, and SPARQL I’ve managed to coax data out one painful step at a time. For example, here’s my “friend-getter” code:

SELECT    ?bfoaf ?bname ?bnick ?bmbox_sha1sum ?bimage ?bweblog
WHERE {
?a foaf:knows ?b .
?b rdfs:seeAlso ?bfoaf .
OPTIONAL { ?b foaf:name ?bname } .
OPTIONAL { ?b foaf:nick ?bnick } .
OPTIONAL { ?b foaf:mbox_sha1sum ?bmbox_sha1sum } .
OPTIONAL { ?b foaf:image ?bimage } .
OPTIONAL { ?b foaf:weblog ?bweblog } .}

Clear enough, I guess. Unfortunately, I just can’t go look at bnick and stuff it into my results because bnick might be some sort of “resource” which then has to programmatically traversed also (see http://api.hi5.com/rest/profile/foaf/208329359). I admit that this might – maybe even probably – is a problem with me, maybe I don’t understand SPARQL well enough.

But that’s old business. The way I’ve been doing this is CURLing down the FOAF file, manually inspecting it, writing some Python/rdflib/SPARQL code and seeing what happens.

This morning I decided to try a new approach: look for a SPARQL and/or RDF browser and figure out the correct queries online, then just write the code once, correctly. In my mind, this way all very sweet: an INPUT field for the FOAF/RDF URI, a TEXTAREA for the SPARQL query, a TABLE for the SPARQL results, and a TABLE showing all the RDF triples, since it’s triples “all the way down”.

Here’s what I did find:

  • Google rdf browser
  • Check out Brown Sauce; have to install a local massive development environment – remember now, I’m trying to save time, not lose it
  • Check out http://browserdf.org/: “Faceted Navigation for arbitrary Semantic Web data”. Very promising. Unfortunately, “arbitrary” seems to mean three different data sets
  • Check out Stefano’s Linotype -- a high quality information source usually; find out about Welkin
  • Try Welkin
  • Find out Welkin doesn’t browse the web
  • Download the FOAF file from http://kitschbitch.vox.com/profile/foaf.rdf into test.foaf.
  • Discover that Welkin doesn’t like “*.foaf”
  • Try again with “*.xml”
  • Try again with “*.rdf”
  • Success, except no results. Why? Oppps … I was downloading the wrong URI
  • Try again with the correct URI
  • Verify that it’s a FOAF file
  • Stare at nothingness coming out Welkin
  • Write a blog post about it; partially regret losing 50 minutes of my morning

The problem – a problem – with FOAF and RDF is quite simple. People don’t want formats that can do anything, they want formats that can do something. I got a Flickr API downloader going in about 30 minutes, taking my time. I’ve put hours into FOAF and still am unhappy.

9 of the best rich text editors reviewed

edit David P. Janes 2008-03-14 08:49 UTC 1  comment  ·  ·  ·

Webdistortion has a review of 9 HTML rich text editors (via the YUI Blog).We’re happy with the new TinyMCE so far, but there may be something here thatstrikes your fancy if you’re looking for something smaller. Here’s the 9 plusmy brief notes:

HTML scrubbing with TinyMCE

edit David P. Janes 2008-03-10 08:56 UTC add comment  ·  ·

One thing we don’t like about HTML online editors is that they make some pretty lousy looking HTML pages. To deal with this, we’ve created HTML “scrubbers” to rewrite HTML coming from these widgets.

The first thing we always do is call TIDY () to normalize the HTML. We then run a list of regular expressions to remove things we don’t like, such as class names, ids, etc. and also things such as trailing empty paragraphs at the end of documents.

We just added another scrubber to convert double BRs within P paragraphs into paragraph splits – this makes the HTML more semantic, that is, to make it say what it means, not what it looks like.

This can’t be done with just a regular expression, of course. Here’s our algorithm:

  • find all <p>…</p> paragraph blocks, always looking for the shortest matches
  • reverse this list, so that we can rewrite the document without having to worry about adjusting search indices
  • look at each match: if contains anything non-simple, leave it alone. Theoretically, since we’re coming out of TIDY this should be well formed and only contain markup like B, STRONG, ABBR, etc. but I never take chances
  • if the match is simple, convert all BR BR sequences to “</p><p>”

With the BlogMatrix Platform editor, every time you save a post it scrubs it and sends it back to the editor.

TinyMCE Version 3

edit David P. Janes 2008-03-09 20:37 UTC add comment  ·

I completed the upgrade to TinyMCE (http://tinymce.moxiecode.com/download.php) this afternoon, tossing away 95% of my old code. I’m very very happy – they’ve very much modernized the code to what we’d expect in a modern JS framework.

For example, this is how we create an editor now:

editor = new tinymce.Editor('id_editor', initd);
editor.render();

And here’s how we listen for events:

editor.onKeyPress.add(function(e) {
           
… do stuff …
});

I haven’t tried to do stuff with the Toolbar yet, but given the apparent serious thought they’ve put into making a nice interface, I can’t imagine it’s going to be very difficult.

Rich text editing with TinyMCE

edit David P. Janes 2008-03-09 11:47 UTC 1  comment  ·  ·

Well, I couldn’t believe how easy it was to make our new editor use TinyMCE – I just downloaded the new version (3.0.4.1 – http://tinymce.moxiecode.com/download.php), hooked it up to our code and it ran out of the box.

Since you may not have done this yourself, I’ll just run you through how we use TinyMCE:

  • make a TEXTAREA that you plan to work with; there are some complications if you want or have multiple TEXTAREAs but this is not an issue for us
  • include TinyMCE: <script type="text/javascript" src="=/jscripts/tiny_mce/tiny_mce.js"></script>
  • call the initalizer function: tinyMCE.init(initd)

That’s it, you have an editor. “initd” is a dictionary that describes how to set up TinyMCE. This is our setup:

initd = {
    onchange_callback : "tinymce_onchange_callback",
    theme_advanced_buttons1 : "bullist,numlist,outdent,indent,separator,justifyleft,justifycenter,separator,link,unlink,image,separator,bold,italic,strikethrough,separator,sub,sup,forecolor,backcolor,separator,code",
    theme_advanced_buttons2 : "",
    theme_advanced_buttons3 : "",
    dialog_type : "modal",
    theme_advanced_resize_horizontal : false,
    entity_encoding : "numeric",
    force_p_newlines : true,
    force_br_newlines : false,
    convert_newlines_to_brs : false,
    relative_urls : false,
    remove_script_host : false,
    verify_html : false,
    auto_reset_designmode : true,
    remove_linebreaks : false,
    theme_advanced_resizing : true,
    mode : "textareas",
    theme : "advanced",
    theme_advanced_toolbar_location : "top",
    theme_advanced_toolbar_align : "left",
    theme_advanced_path_location : "bottom",
    plugins : "inlinepopups",
    content_css : "/:root/include/common/tinymce.css"
};

You’ll have to modify the location of the CSS file and the callback (so we know whether the document has been edited!) to something you prefer to use, but you get the idea.

Also note that you may have to do some magic to move the data between the TinyMCE window and the TEXTAREA:

  • To move the data into the TEXTAREA, do: tinyMCE.triggerSave()
  • To go the other way: tinyMCE.updateContent(idName), where idName is the DOM ID of the TEXTAREA; whoops -- in version 3.x use tinyMCE.activeEditor.load();

Version 2 of TinyMCE misbehaved if you tried to create an editor in a hidden DIV (i.e. with display: none); I’m not sure if this issue is gone or not but try to avoid doing it.

 
 

Playing with Yahoo’s Rich Text Editor

edit David P. Janes 2008-03-09 10:53 UTC 2 comments  ·  ·  ·  ·

I spent a fair fraction of yesterday playing with Yahoo’s Rich Text Editor (http://developer.yahoo.com/yui/editor/), trying to integrate it into what we’re calling “V12” of the BlogMatrix Platform – the look and feel you’re seeing on Onaswarm (http://www.onaswarm.com) right now.

Currently, we’re using TinyMCE as a text editor. On the plus side, TinyMCE is standard and reliable; on the minus side, it’s difficult to work with, very difficult to extend, and fairly hefty. So as a background task I’ve been looking at various technologies and seeing what can be done. TinyMCE has recently revved from 2.x to 3.x (http://tinymce.moxiecode.com/punbb/viewtopic.php?id=9942), so we’ll be revisiting that soon.

One of problems with all browser based editors is that that they, well, make weird looking HTML. Especially on Safari (AKA webkit). We’ve partially solved this problem at BlogMatrix by running a number of “scrubbers” that look for well-known weirdnesses and transform them into something better. Generally this works as a pipeline from TIDY (http://tidy.sourceforge.net/), to a bunch of regular expressions, to TIDY again. The goal is to be able to process normal hand-entered input into good-looking HTML, but still preserve the formatting pasted in from another webpage or document. Believe it or not, we’ve had a lot of luck with this.

You’d think that all this could be done with a Flash component – this would nicely solve multiple browser problem, but alas I haven’t found a Flash editor that accepts pasted text and does something sane with it. If you’d like to look at this, I’ve posted notes here on del.icio.us (http://del.icio.us/dpjanes/text_editor).

So, back to the YUI RTE – here’s my positives and negatives in no particular order. I’m aware that this is not a finished product, so expect to see the negatives to disappear.

  • it’s super easy to configure, especially the toolbar
  • there’s no HTML view; like WTF, I have to write this myself?
  • block indent/undent is broken beyond repair
  • I had no problems mixing and matching with MochiKit (http://www.mochikit.com/), our JS weapon of choice.
  • The dialogs that pop up for editing links and images are worth a whole section to themselves:
  • here we see the limitations of CSS styling – even the slightest change to text (from my CSS) breaks the dialogs. In this particular case boys and girls, perhaps even go back to table layout; it’s an app, not a webpage
  • in fact, just isolate this altogether out of RTE. By the looks of it, this seems to be the plan.
  • for proper styling, YUI components need to be descended from an element with CSS class “yui-skin-sam”. Unfortunately, dialogs attach themselves to the BODY element and we can’t mark that as “yui-skin-sam” because that, well, breaks all our stuff. So we ended up having to copy out all the CSS, remove “.yui-skin-sam”, adjust all the background images and so on and so forth.

I may experiment next with Ext’s editor (http://extjs.com/deploy/ext/examples/form/dynamic.html) but I’m thinking the best bang for our buck is still TinyMCE. If only I could figure out how to add my own buttons and functions….

For what it’s worth, I’m composing this using MS Word 2008 on a Mac, which doesn't play nicely with anything. Sigh