I woke up this morning with the intention of writing a "best practices" guide to doing microformats only to find out
that
Glenn Jones had beaten me (handily) to the task. In my mind this should be
converted into a wiki page.
Brian Suda has an article on Dev.Opera about XFN - XFN encoding, extraction, and visualizations.
It's a great place to start reading about XFN:
XFN stands for XHTML Friends Network[...]. It grew out of the common publishing trend of linking to other sites you enjoyed
reading. On your blog, this is called a blogroll - it is common to think of people by their web sites. Their URL is a
representation of that person; part of their online identity. XFN is an attempt to codify these relationships using standard
HTML.
I was going to nitpick about Brian's point about tool support for XFN, but I shall get into that in upcoming posts about
the social graph.
Kevin Marks reports:
I made an initial conversion to hAtom by hand in the meantime,
but a few weeks back Michał Cierniak and I checked in a
change to the underlying Blogger templates to make hAtom the default, which the Blogger team graciously accepted. This should
enable much simpler client-side parsing of the blog pages.
Kevin has more details on how to add this to your template if you're using new-style Blogger templates. Here's a note I sent
to the Blogger mailing list when someone raised the "what good is it" question:
Just to clarify, hAtom was never intended to be a "syndication format"
nor to compete with Atom or RSS. It's simply designed to describe the
microcontent on webpages, such as blog posts. We used Atom because it
provides a well-defined nomenclature for describing such elements.
What can you do with it? You could provide search results that narrow
into the exact content on a page, rather that keywords that were found
elsewhere on a page. You can write tools, such as entry pretty
printers, or "reblogging" tools for quoting posts that work
universally across hAtom blogs, rather than depending on the author/
publisher to provider this for you. Because it effectively
standardizes CSS elements for blog posts, you can write CSS that works
across all hAtom conforming blogs. You can combine with other
microformats or "POSH" HTML to associate data displayed on a page
(say, the geographic location of where a post was made) with the exact
post it belongs to.
In and off themselves not earth shattering perhaps, but not bad for
standardizing a half-dozen or so tags with minimal effort for
publishers.
All templates in the BlogMatrix Platform are hAtom compliant.
Google Maps now supports hCards.
What does this mean? With a little magic (which will be universally available across all cool browsers in a year or so), if you look up an interesting
location you can add it directly to your address book ... or maybe your business intelligence app.
Operator 0.7 is available. Apparently it has
hAtom support, which I'll have to check out.
Bill de hÓra (on a Danny Ayers post):
In other words, the work of generating RDF will be placed on people who want to use RDF. I think this idea of extracting RDF from published markup instead of using RDF as the backing data to generate the published markup is a big deal. For one, it will mean less RDF tax on existing publishers, who seem to be happy to stay with HTML, RSS and microformats (uF). Second it distributes costs fairly - RDF proponents will be forced to derive value from what they extract instead of playing schedule chicken with publishers, and pushing costs back onto them to supply the data just so. Third, from a systems design viewpoint, extraction is a much cleaner design than trying to kludge RDF support on top of existing RDBMS storage and web frameworks. It's cheaper today to publish uF via web frameworks, databases and templates than retool internally with RDF based technology - uF by being HTML is a relatively low-impact upgrade on the templating tier, not a rip and replace of the data/object tiers. I've been saying for some time that the Semweb is missing a layer, the one that infers the useful information from syntactic markup. Maybe uF and GRDDL are that layer's ingredients.
Read/Write Web is reporting that Firefox 3 (don't forget that 2 just came out) is going to have deep microformats support:
Alex Faaborg explains that microformats will make the Web Browser into an "Information Broker" and suggests that this could happen in Firefox 3. He writes:
"Much in the same way that operating systems currently associate particular file types with specific applications, future Web browsers are likely going to associate semantically marked up data you encounter on the Web with specific applications, either on your system or online. This means the contact information you see on a Web site will be associated with your favorite contacts application, events will be associated with your favorite calendar application, locations will be associated with your favorite mapping application, phone numbers will be associated with your favorite VOIP application, etc."
[...] Mitchell Baker from Mozilla calls this "data-browsing" in another post. And Alex has links to more info on Mozilla's microformats project on this page. I particularly enjoyed this discussion of which microformats Firefox 3 might support. Alex noted in that post:
"Detecting information in Web pages and handing that information off to other applications changes the role of the Web browser from being solely a HTML renderer to being an information broker."
As of now, there is a Firefox addon called Operator, a microformat detection extension developed by Michael Kaply at IBM. So the seeds have started to be sowed.
I tried Operator but I had to uninstall it: it's too much work to trawl through the DOM looking for microformats every time one switches web pages. Perhaps Adobe's gift of an efficient JavaScript engine will improve the situation where we won't care how expensive (within reason) JS programs are.
Here's more from Faaborg: microformats introduction, structured data chaos.
I've started a new page on the microformats wiki to discuss an "item" microformat, to represent physical things. This comes from this conversation.
I've been down this road before; I'm the author of the hAtom microformat for representing microcontent.
Left Logic has produced a very pretty "Microformats Bookmarklet" that will extract hCard and hCalendar events from a webpage.
We have a really neat demo of importing microformats into BlogMatrix. However, we can't show you. Why?
We'll have to recompile Apache against Python's Expat and go from there. This will cause a short period of downtime, so we won't be doing this for a little while yet.
Zack Rosen has a post called "RDF Semantic web research isn't working". It's a very easy read and yet so packed full of interesting points that I won't quote any of it and will just say "go read it".
A few additional comments:
-
many SW people "don't get it". Sorry, we don't model the world in triples so starting your sales pitch with that just doesn't cut it; and I'm not an idiot for disagreeing with you
-
the SW missed a wonderful opportunity by not jumping on the "mashup" bandwagon, where it would have been a natural fit for arbitrary data passing between apps (rather than crud-o hand rolled XML formats)
-
every page produced by the BlogMatrix Platform has a corresponding XML/RDF page. Placing the structured data into the RDF shouldn't be too difficult except I'm really not going to make the effort if there isn't the demand
-
I'm working on articulating an alternative vision to the Semantic Web called the Datasphere built microformats (for data sharing), structured blogging (for ad hoc data creation), tagging (for fluid structure) and directories (for inherent structure). Stay tuned.
Once of the projects we've been persuing over the last year is the "Almost Universal Microformats Parser", a Python library that does a pretty good job of breaking apart microformats. You can run the AUMFP against any webpage here or download (currently an old version of) the source here.
We've made quite a few modifications in the last two weeks and we're getting ready to release a new version of the source and also a few tools based on this project. We'll wait till we have more documentation and testing in place before we do the release though. We do the include-pattern thingie now and the interface makes a lot more sense.
As part of the update, we've extended the parser to handle hResume and tested against the samples pointed to on the Wiki. We also tried to identify places where documents don't conform to the proposed standard and document them as quirks (in general, we write our software to fail-as-last-resort). Here's the results of our testing:
Note that the "contact" quirk may be a misinterpretation on my part.
OpenID has a protocol extension that allows some simple identity information to be shared (i.e. so you don't have to type in the same info over and over again in every site you visit).
Here's the spec and here's what information is sharable:
-
nickname
-
email
-
full name
-
date of birth
-
gender
-
postal code
-
country
-
language
-
timezone
My personal opinion is that the "Simple Registration Extension" was a mistake as it confuses what OpenID is all about, is inherently incomplete and doesn't build upon what's already out there. OpenProfile has taken the extra step and hooked up with vCard but let me propose something else based on microformats and hCard.
One issue I have with the utility of hCards is that often, as they are seen in the wild, they are basically trivial. From here:
<address class="author vcard">
<a class="url fn" href="http://theryanking.com">Ryan</a>
</address>
Note that this is still good -- this marks real semantically useful data. However, I always get the feeling that this could be ... better, particularly since we know Ryan (King) has a real hCard right here with a mugshot and everything!
Why not build upon the concept of the identity URL? In particular, let's say that that http://theryanking.com is Ryan's identity URL. In addition to the link and meta tags added to the header for OpenID, let's add one for "person identity":
<link rel="identity.hcard" title="Ryan" href="http://theryanking.com/blog/contact/#vcard" />
We now use this in two different ways.
Firstly, from an OpenID perspect when Ryan creates a new account using http://theryanking.com, the consumer can automatically go get Ryan's hCard at http://theryanking.com/blog/contact/#vcard.
Secondly, we can indicate in Ryan's trivial hCard (seen on the microformats blog) that we can get Ryan's real hCard:
<address class="author vcard">
<a class="url fn x-identity.hcard" href="http://theryanking.com">Ryan</a>
</address>
This has several nice attributes:
-
We don't have to add verboten hidden data to indicate the location of the "best hCard"
-
It's totally backwards compatible -- if we just used "url" we'd have to check too many spurious links for non-existent information
-
It builds open microformats and OpenID -- public identity parameters only have to be specified in one location
Other possibilities are also there, such as defining extensions for this type of link in RSS and Atom.
Note: x-identity.hcard and identity.hcard are possibly just placeholder strings for something better, so don't get bent out of shape by that.
One of the inspirations behind this software and the concept of the datasphere is Adrian Holovaty's Chicago Crime which I first say at Mashup Camp 1. Heavily data driven, almost everything in CC is a link -- you can can freely navigate through the data by clicking around. I'm sure if it's strictly-speaking a datasphere application because it doesn't bubble up the data for reuse in the HTML, but one could fairly easily envision how it could in the future. Adrian has a great post about where he should think newspapers should be going and there's directly applicability to the concept of a datasphere: But it doesn't stop at those obvious examples. If you take some time to examine what sort of information newspaper journalists collect, the amount of structure will jump at you. If I may take the liberty of giving examples from Web sites I've worked for: - An obituary is about a person, involves dates and funeral homes.
- A wedding announcement is about a couple, with a wedding date, engagement date, bride hometown, groom hometown and various other happy, flowery pieces of information.
- A birth has parents, a child (or children) and a date.
- A college graduate has a home state, a home town, a degree, a major and graduation year.
- An Onion-style "On the Street" feature has respondents, answers and a publication date.
- A drink special has a day of the week and is offered at a bar.
- The schedule of the U.S. Congress has a day and multiple agenda items.
- A political advertisement has a candidate, a state, a political party, multiple issues, characters, cues, music and more.
- Every Senate, House and Governor race in the U.S. has location, analysis, demographic information, previous election results, campaign-finance information and more.
- Every known detainee at Guantanamo Bay has an approximate age, birthplace, formal charges and more.
See the theme here? A lot of the information that newspaper organizations collect is relentlessly structured. It just takes somebody to realize the structure (the easy part), and it just takes somebody to start storing it in a structured format (the hard part).
Note that Adrian's mostly talking about recording the structure behind information so new applications can be developed. But the beautiful thing about bubbling up the information into the HTML is you can start cross linking data between different sources (i.e. mashups!).
There's a discussion on podcasting university lectures over at Slashdot. I had a discussion with a friend that works at Memorial about this very topic several weeks ago and it's probably worth looking into in a deeper way. The gist of the idea is:
-
set up microphone(s) in every lecture hall
-
record each lecture (obviously!)
-
students, instead of taking notes (or only notes), would record the time of a particular interesting or salient comment
-
students could then easily go back and re-hear a particular part of the lecture at their leisure
Tagging and microformats aspect:
-
tagging provides a natural way to classify podcasts. That is, instead of coming up with a set of silios to dump podcasts into, each podcast would be tagged with many words as appropriate. For example: "physics P320 2006 william_smith 2006-09-05T10:00".
-
If something like the BlogMatrix Platform was used (ahem), faceted tags would provide an even more powerful classification system: "subject:physics course:P320 year:2006 by:william_smith date:2006-09-05 time:10:00". Using faceted tags allows one to do queries like "what podcasts are available in physics in 2006".
-
In either case, the whole recording and tagging process could pretty well be automated by tapping into a class schedules, minimizing the manual work needed to be done. This brings a whole microformat (hCalendar) aspect to the project.
Social media and student blogs:
-
Students should be able to bookmark online and comment upon their favorite parts of podcasts
-
These bookmarks obviously need to be retrievalable; my belief is that student blogs may be the best place to record this information (that is, collapse social bookmarking and tagging applications)
-
These bookmarks will need to be able to do deltas into a podcast
-
Collecting all these bookmarks across all students (and potentially across time) will provide collective intelligence/data mining/insight into what is really import in the lecture
Random-access media:
-
Bandwidth doesn't come for free: it isn't cost-effective for students to have to download the entire lecture to hear a three minute clip
-
Streaming media players could be a solution to this problem
-
Something like a Flash MP3 player could potentially do "random access" to the right place in a lecture
Security:
-
The discussion on Slashdot is centered around "security" -- that is, allowing only certain people to listen to podcasts
-
I don't believe it's the place of the vendor (i.e. me) to dictate requirements to a client; however, the later potential requirements -- barcode scanning for attendence, for example -- seems crazy
-
Restricting access to a subset of students (and professors) seems fairly straight forward, if a student profile can be retrieved from an LDAP directory
We at BlogMatrix find this project because these are exactly the types of "organizational collabrative" applications that we'd like to see BlogMatrix being used to create.
Google is in the process of revamping the way Blogger.com works. Chris writes:
Hello,
Not sure to whom I should address this request, but I’m very excited about the Blogger Beta and that it represents an open opportunity to add support for microformatted content.
You can read more about microformats at microformats.org, but to summarize, microformats are community-developed standards for identifying certain kinds of information in webpages using your typical HTML tags and classes.
In particular, this is my wishlist of microformats that I would love to see Blogger support:
-
rel-tag: okay, you already took care of this one, so kudos!
-
XFN: WordPress already supports this, and it’s especially useful for representing lists of friends in blogrolls.
-
rel-me: from the XFN family, being able to link to other pages on the web using rel=”me” creates an informal means of “claiming” other places where I publish online. Read about Ma.gnolia’s addition of rel-me.
-
hCard: marking up personal profiles in hcard means that if I add personal contact details, people can click a link to add me to their address book without any extra typing. I’ve done this on my main blog. Clicking the “Add me to your address book” link will convert the HTML content in that page into a .vcf file that most address book programs can recognize.
-
hCalendar: In order to make it easy for my readers to add events that I’ve blogged about to their calendars (Google Calendar or others, like iCal), I can use hcalendar to mark up this information with a link to add the events to their calendar. Here’s an example.
-
hAtom: This one is fairly simple to implement since you’re already classing most of this information already. hAtom uses element names from Atom as class names. This allows people to subscribe to blogs directly, without the need to subscribe to RSS. You can read more about this.
Though the benefits may not seem immediately obvious to supporting microformats, the amount of effort required to add support is fairly minimal compared with other, more substantial features that you’re probably already working on. Furthermore, our community would be happy to help with the process of adding support to Blogger, validating your work and providing guidance along the way. This initiative is also not a commercial effort; rather, it represents the work of a large, distributed, worldwide community that wants to build out the value of the “lowercase semantic web” and to make data storage in web pages a reality.
In some respects, we are at a chicken-and-egg crossroads but the more support that we see for microformats in the wild, the more tool makers, publishers, browsers and other applications will reap the benefits of this effort to essentially modernize the web, incrementally building upon the existing infrastructure.
Thanks for your consideration and please let me know if there is any way that I can be of service.
Chris
We fully endorse this letter and in particular would like to make the case for getting hAtom into the template process ASAP:
-
hAtom identifies almost all the commonly used elements in blog posts.
-
standardizing around hAtom class names will make it easier for designers to modify and understand templates. Actions, such as "print this post" can use a shared printing CSS file, as it will inherently understand how all posts are structured.
-
standardizing around hAtom class names will make it easier for search companies -- such as Google! -- to understand that a page contains blog posts, what part of the page conatains the blog posts and what isn't part of the blog post. This will make search results within blogs more accurate by giving the ability to ignore "incidental matches" where one blog posts matches but (for example) some other non-important term in the sidebar matches also
-
standardizing around hAtom class names will make it easy to do "reblogging" -- that is, quote part of the content of one blog post into another post -- thus making it easier to create blogging communities and blogging conversarions
-
as new templates tend to be created from existing templates, the greatest benefit for adding hAtom to Blogger templates would be to do this as early as possible in the development/roll-out cycle
Updated (2006.09.05):
-
machine-readable weblog archives (tip: Danny Ayers); this would be especially powerful if combined with a directory of posts (OPML/XOXO) or a microformat for marking weblog archive structure
-
Winer-style "river of news" pages can be made with a simple one-stop XSLT transformation
John Allsopp - Add Microformats Magic To Your Site:
Heard of the semantic web? Using Microformats everyone can contribute to the richness of the web.
[...] Say you want to sell your car. How do we go about doing that? Like with IMDB for film reviews, or Amazon for book reviews we find a centralized place to post a classified listing - or maybe several. Each requiring our effort and time to fill in details, each possibly requiring some form of payment. Each a walled garden.
But what if we could somehow post this listing to our blog, and then easily let services which cared about classifieds listings know that there is a new or updated classified at my site. The missing piece that would enable this is a standard format (after all html doesn’t have a <classified> element).
[...] But where will these formats, for reviews, for classifieds listings, for citations, and as yet undreamt of uses come from?
[...] But if we think for a moment about how the web has developed so successfully - it’s been incremental, evolutionary, not revolutionary - we might find a way forward that doesn’t break our tools and browsers and force developers to learn a whole new set of skills and concepts. And this is the approach that microformats take to developing simple, open web data formats built on existing data formats and existing developer practices to enable and encourage decentralized development, content and services.
There's lots more there, with a number of vCard examples and the ever important . Allsopp doesn't mention hAtom, which is too bad because it's probably one of the microformats most easy to autogenerate.
He's also written an article called "The Big Picture On Microformats" for Digital Web Magazine. This is a higher level more conceptual piece that deals with the "why" of microformats and issues about deployment. If you're not interested in the tech guts of microformats, this is the one to read.
Another quick aside: Microformats get around the chicken/egg problem by being so light-weight to deploy that there's little risk for content producers; this incrementally incents consumers to do stuff with microformatted data; and so forth in a virtuous cycle.
Oh, and the Cork'd (wine rating) site looks pretty cool.
DeWitt Clinton has a good article about why you -- i.e. "we" -- should use Atom over RSS:
If you’re a human then you’ll probably have no problems spotting that the first one is plain text, the second one is XML-escaped HTML, and the third is HTML wrapped in an XML CDATA section. If presented in a web browser, in a HTML <div/> tag perhaps, then a human will have no trouble interpreting the content.
But if you’re a computer, it isn’t quite that easy. To a computer, the contents of a RSS <description/> element are opaque. The best a computer can do with it is hope to render it for a human to interpret.
...
What if you added semantic microformat markup to your HTML? If you’re using an opaque data format, then you may as well have spared yourself the effort, as no client will know it’s there.
Or what if you wanted to put some other structured data in your syndicated content feed? Geospacial data, perhaps. Product data. Or perhaps Google’s GData format. If it’s syndicated over RSS, no one will ever know.
So the problem is that the RSS syndication format is that it is lossy. Lossy insofar as information you had when writing the data is lost when it is passed over the wire.
...
My recommendation to application developers today is to use Atom 1.0, not RSS, as the basis for your content syndication.
Alas, commenter Kosso finds the key problem with using Atom:
Does Atom support enclosures. And multiple ones at that?
If so, I would look at creating a toolset to podcast in both formats.
However, that does not mean feeds won’t be broken. So many publishing tool are broken. RSS is ’simpler’ than atom.
...
I don’t want to fan a feed war, but I want to judge by trying to build a feed publishing tool which works.
BlogMatrix will always have a podcasting component (i.e. adding attachments to posts) and that means until Apple's iTunes accepts Atom feeds, well, RSS it is. Afterwards ...
I just want to make a brief post about this "web clipboard" using eRDF, acknowledging its existance. What I'm hoping to do with the BlogMatrix Platform is have every page available in alternate formats -- in particular, RDF in XML and N3 -- and then use some sort of microformat way of linking "important" objects between the HTML and the RDF. This would allow for clever things like copying full address book entries (even though we may be only display a name), reblogging, copying events... Yes, all very vague. We'll see how it works out.
|
|
Recent Podcasts/Videocasts
|