BlogMatrix
 

Yahoo Pipes

edit David P. Janes 2007-02-08 17:28 UTC add comment  ·  ·

Yahoo introduced a very interesting service this morning: Yahoo Pipes. If you have a Yahoo account (and are somewhat geeky), go check it out. It's an RSS workflow system, where you can take RSS feed(s), do dome processing and logic and produce

From a BlogMatrix technology point of view, it's quite interesting. It uses a similar data model to ours, what I've been calling the "Google Base" model of the semantic web: entry centric, where each entry is extended by ( attribute, value ) pairs. In particular, there's no "deep" model of attribute values with hierarchy or graphs. The beauty of this system is not only that it's fairly straightforward to do, but it's also easy to mentally grasp and thus more like that people will actually use it. Here's the cool this: with the BlogMatrix Platform we should be able to very quickly demonstrate feeding data into this tool (real soon) by feeding our structured data input into our RSS output!

I wonder if there's a way to export the programs created by Yahoo Pipes? And I wonder what this means for Teqlo? And I wonder if we can create a general web programming model on this in a sort of  Dabble DB plugin sort of way?

Here's what people are saying:

  • Nial Kennedy:
    Yahoo! Pipes opens up some interesting possibility for feed aggregators, letting users filter out unwanted content affecting their experience. Pipes opens up a few feeds that were not practical for a human to read in the past, either due to a high volume or possibly a foreign language. My favorite operator is the location extractor which analyzes an item's text attempting to identify addresses, locations, or the URLs of popular mapping services.
  • Anil Dash:
    Most importantly, and perhaps most key to the success or failure of Pipes, are the social functions that underpin the application. With Pipes, it's easy to make your own web services public, to clone web services that others have made, or to offer your own services for others to clone. That element of social sharing of code, first pioneered by platforms like Ning, makes the open source ethos much simpler to participate in. Instead of setting up complex version control systems and submitting patches to a central repository, application cloning works on a principal of infinite forking, taking the idea of embracing failure and building it into the platform. Code 'em all, and let blogs sort 'em out.
  • Tech Crunch:
    The beauty of the application is with its simplicity - a user can take any sources, user input requests or the above mentioned module and drag+drop them into place and then connect the pipes. Within minutes I had built an application (also known as a pipe, they should probably change the name as not everything can be a pipe) that would search for ‘Techcrunch’ in a variety of feeds, bring that data together, sort it and filter it for unique results. I saved the application and published it
  • Tim O'Reilly:
    Yahoo!'s new Pipes service is a milestone in the history of the internet. It's a service that generalizes the idea of the mashup, providing a drag and drop editor that allows you to connect internet data sources, process them, and redirect the output. Yahoo! describes it as "an interactive feed aggregator and manipulator" that allows you to "create feeds that are more powerful, useful and relevant." While it's still a bit rough around the edges, it has enormous promise in turning the web into a programmable environment for everyone.
  • Brady Forrest: Deconstructing a Pipe
  • Brady Forrest: The Modules For Building Pipes
  • Global Nerdy (update):
    There is one important difference between Yahoo! Pipes and those of the Unix variety: while Unix pipes were made with programmers, sysadmins and tech tinkerers in mind, Yahoo! Pipes are made to be more user friendly. While you’ll still need a tiny bit of tech savvy to use Pipes, the user interface, which allows you to visually hook up pieces of code that provide an API significantly lower the barrier to entry for creating applications — you no longer have to be coder!

Google and Structured Data

edit David P. Janes 2006-10-02 11:53 UTC add comment  ·  ·  ·

The PC Advisor reports that Google plans to extend its search results to give Google Base (read more) results also:

Google plans to extend the product search capabilities on its main Google.com search engine in the fourth quarter, in time for the holiday shopping season.

[...] When people search for products on Google.com, the system will present them with another search box so that they can refine their query, according to Bear Stearns & Co analysts.

After people refine their query, Google takes them to a second page populated with product results from the Google Base listings service.

That is, if your company places information about "widgets for sale" into Google Base (with price, picture and description information), people searching for "widgets" on Google will be given the option of seeing the "for sale" results.

The Read/Write web believes this will pose a problem for microformats; I'm not so sure -- microformats are a way of specifying structured data elements in a web page in a formal and open way. Google Base is a proprietary database run by Google Corp -- if you can create the data to go in the latter, you should have no trouble publishing it for the former.

Google Base is now accessable by the GData API

edit David P. Janes 2006-08-23 13:57 UTC add comment  ·  ·  ·

Michael Fagan has spotted that Google Base (which we have written extensively about) is now accessable by the GData API.

We'll have a longer post about this soon. Quick notes:

  • this means Google Base is now readable in a structured fashion -- this is good, very good
  • missed by most coverage so far, but almost as important: AuthSub let's third party applications access Google Base on your behalf. BlogMatrix will be all over this!

Google Base now provides RSS feeds ... without Google Base data

edit David P. Janes 2006-07-23 11:23 UTC add comment  ·  ·

Search Engine Watch is reporting (hat tip: Scoble) that Google Base is now providing RSS feeds. For example, this page has this feed (the RSS icon is in the upper-right hand side corner). Alas, it isn't providing the "Google Base data" in these feeds. If you want to know why this would be good, read on.

Google Base -- summing up

edit David P. Janes 2006-07-17 11:53 UTC 4 comments  ·  ·

I hope you enjoyed and found at least a little bit useful this series of posts about Google Base. I'm sure there's a few mistakes and I'll correct them as I -- or you, there's a comments section, you know -- find them.

Everything I've written about Google Base is here.

Broken links are now fixed. 

The Google Base data model as a Semantic Web language

edit David P. Janes 2006-07-17 11:39 UTC 2 comments  ·  ·

What is the Semantic Web? Here's Wikipedia's definition, which is probably as good as any, but a good working definition is a layer of the World Wide Web that is meant to be read and understood by computer programs (as opposed to the traditional web, where humans are the end consumer).

I beleive the Google Base data model provides an excellent addition to tools and languages currently being used to bootstrap the Semantic Web. In particular:

  1. The GBase data model is easy to produce.
    This is a huge advantage. When I read articles like this (Danny Ayers comments) about the semantic web, I get the impression of ivory towers and massive queries taking weeks to write to query parts of protein databases. Maybe that's not fair, but my vision of the Semantic Web is something much more personal, something almost trivial to produce as a byproduct of day-to-day activities, such as blogging, wikiing, e-mailing and so forth
  2. The GBase data model is easy to consume
  3. The GBase data model is easy to transform into RDF (or anything else)
  4. The GBase data model is easy to understand (RDF's biggest problem, ahem: "A triple can simply be described as three URIs. A language which utilises three URIs in such a way is called RDF" -- that explains a lot!)
  5. There is a lot of being produced by a lot of different people for Google Base
There's a few things that would greatly improve the utility of Google Base and its data model:
  1. Google should export it's database in XML
  2. Google should consider modifying (upgrading?) it's data model as per the suggestions below
  3. We need to see "open" or at least non-Google consumers of GBase data
  4. We need more Google Base data producers. BlogMatrix is doing its part. (We also produce RDF and N3, thank you very much).

I know the word language isn't probably the right one, but it feels right to me.

Improving Google Base: populating the db

edit David P. Janes 2006-07-17 10:57 UTC add comment  ·  ·  ·

This idea is somewhat different than the previous couple of ones we've been posting and more related to the plumbing that the db model.

We've produced a tool, based on structured blogging concepts, that can easily populate XML-driven databases such as Google Base. As mentioned in the first post of this series, you can go see this for yourself.

So what's the problem? This: how do we get it into Google Base?

Currently, Google Base has an "upload" model, where one logs in and uses browser upload to put the file into Google Base. This is great if you're some guy sitting at a computer, but no so great if you're a third-party service provider that the "some guy" has to give his Google password to!

I have two suggestions, both not-difficult to implement:

  1. When specifying the upload file in Google Base, let the user be able to say that a different Google account can upload into this file. For example, allow BlogMatrix to do it.
  2. Google's good at crawling the net. Instead of specifying a file, let the user specify a URI where Google can go get the data. When the data's ready, the user (or a tool on their behalf) can ping Google that something's changed.

Improving Google Base: inter-record linking

edit David P. Janes 2006-07-17 10:27 UTC add comment  ·  ·

This idea is a little more contraversial and probably "out of scope" for Google Base. Nonetheless, I can see a few non-insignificant advantages. Allow Google Base records to link to other Google Base records. These links can be in one of two forms: 

  • hierarchical and dependent
  • explicit, via URI

Let's deal with the second form first. Google Base could define a new Attribute Type which defines a link to another record in Google Base. Then instead of (or in addition to) creating (say) a Google Base record for a course that lists the professor and the university, it could explicitly link to the professor and university records. Now we can start doing all sorts of interesting things with our data. Obviously, link consistency is a probably but given the fluid nature of Google Base's model, I suggest just letting the end user sort it out.

It would be nice, BTW, if these URIs could be written in such a way that they weren't dependent on having the record actually stored in Google Base. Perhaps this could be defined in terms of a URN?

The other type of link -- hierarchical and dependent -- would introduce a "container" attribute. Inside a container one could place a new set of Attribute values. When the outermost container is deleted, so would all dependent records. What does this get us? Well, it brings for example the Business Locations model back into the fold (especially if we implement simple structure also).

 

Improving Google Base: do something with GData

edit David P. Janes 2006-07-17 09:52 UTC add comment  ·  ·  ·

It's bad that GData and Google Base don't have a lot of overlap. Here's a few ideas:

I'm sure there's a lot more that could be done, though I doubt many of GData's "kinds" would nicely fit intp Google Base's data model.

Improving Google Base: simple structure

edit David P. Janes 2006-07-17 09:43 UTC add comment  ·  ·  ·

Another useful feature for Google Base would be to allow "simple" structure to be added to Attribute Types. In this simple structure, readers (i.e. Google Base) are free to move the inner structure elements "up a level" with the net result that there would be no change needed for their DB model.

For example, here's a "location" (from here):

<g:location>
1 Bank Street
Ottawa, Ontario
Canada

</g:location>

I propose they also accept: 

<g:location>
    <g:street-address>1 Bank Street</g:street-address>
    <g:locality>Ottawa</g:locality>, <g:region>Ontario</g:region>
    <g:country-name>Canada</g:country-name>
</g:location> 

The 'location' attribute gets stored exactly the way it would be in the first case (we strip the inner markup) BUT we get the additional benefit of the all the new attributes AND we don't have to throw away information we already know!

One could also see this being used in the proposed "person" attribute

<g:person>
    <g:given-name>David</g:given-name> <g:family-name>Janes</g:family-name>
</g:person>

Note that the new attribute names I'm using are based on the vCard standard

Improving Google Base: more attribute definitions

edit David P. Janes 2006-07-17 09:43 UTC add comment  ·  ·

In the previous post, we mentioned the "person" attribute as if it were part of the Attribute Type definitions. Of course, it isn't. The definitions do mention all sorts of persons (strangely but no doubt coincidently begining with the latter "a"): actor, agent, artist, author.

It would be very useful if Google Base defined a number of very basic concepts as attributes. This would make reusing and understanding what new (and existing) definitions mean a lot easier. Here are a few suggestions for the basic types:

  • person -- yields actor, agent, and so forth
  • organization -- yields university, employer, ...
  • phone_number -- yields fax_number, home_number, ...

Improving Google Base: reusing attribute definitions

edit David P. Janes 2006-07-16 21:28 UTC add comment  ·  ·  ·

The next several posts will be about using the Google Base data model as a language for the Semantic Web.

As outlined in this series of posts, Google Base is very flexible in defining new attributes for the database. Unfortunately, new attributes have to be defined in terms only of base data types and the definition of the type is implied, not defined, by the tag name the user assigns. This is overly simplistic, an unnecessary restriction and inflexible.

For example, let's say you want to define a new attribute. Let's say the person to contact. Since there's no "contact" defined in the standard Attribute Types under the current model, this is what you'd add:

    <gc:contact_person type="string">Johnny Chase</gc:contact_person>

("gc" is the "http://base.google.com/cns/1.0" namespace, as defined here.) Let's face it: this is pretty thin gruel. The computer knows that this is a string and -- if you can read English -- humans can infer than that this is a "Contact Person". Google Base is so close and can do so much better.

We propose that Google Base should allow new Attribute Types to be defined based on existing Attribute Types. For example:

    <gc:contact base="person">Johnny Chase</gc:contact>

That is, we've defined a new type called Contact that's based on an existing Attribute Type called "person"*. Ooooooo ... very nice, very simple, and we've already gained a lot knowledge -- from a computer point of view -- what "Johnny Chase" is all about. And Google Base hasn't lost anything either -- deep down, it knows it's just a string.

* we know. See the next message.

Google "GData" API

edit David P. Janes 2006-07-16 20:12 UTC add comment  ·  ·  ·

If you're researching Google's various APIs, you're bound to come across something called the "Google Data API" aka GData. It describes itself as:

The Google data APIs ("GData" for short) provide a simple standard protocol for reading and writing data on the web. GData combines common XML-based syndication formats (Atom and RSS) with a feed-publishing system based on the Atom publishing protocol, plus some extensions for handling queries.

It's a lot more than a protocol though. It also defines a data model ("kinds") for populating commonly used elements. Here's some of the types:

  • gd:comments
  • gd:contactSection
  • gd:email
  • gd:entryLink
  • gd:feedLink
  • gd:geoPt
  • gd:im
  • gd:originalEvent
  • gd:phoneNumber
  • gd:postalAddress
  • gd:rating
  • gd:recurrence
  • gd:recurrenceException
  • gd:reminder
  • gd:when
  • gd:where
  • gd:who

These elements have deep structure, attributes and other such things. What does it have to do with the Google Base model? Easy to answer: nothing. This is very very unfortunate and it probably a good sign as any that Google's becoming a pretty big company, like IBM or Microsoft.

What use does Google have for GData? It's main purpose at this time is to allow outside to tools to populate Google Calendar. We can only hope this will somehow be merged or made consistent with the Google Base model and API.

Google Base "Business Locations"

edit David P. Janes 2006-07-16 19:22 UTC add comment  ·

One strange discontinuity of the Google Base data model is bulk uploading "business locations". Unlike all other Google Base items, these cannot be uploaded in an RSS/Atom XML file. Instead, they must be uploaded using a CSV spreadsheet file. For completeness, we shall outline the data model used here:

  • STORE_CODE (a unique user defined store code)
  • ADDRESS_LINE_1
  • CITY
  • STATE
  • POSTAL_CODE
  • COUNTRY_CODE
  • MAIN_PHONE

The meaning of thes should be fairly self explanitory. Strangley, bulk uploading store locations is only available in the US and the UK. What's wrong with the rest of us?

The reason for the discontinuity is that the complete set of all addresses for a store is considered a single Google Base item. As you can easily see, this wouldn't easily map back into the low-structure XML definitions all other Google Base items are using.

We'll have suggestions in an upcoming post how this part of the data model could (and should) be brought back into the fold.

The Google Base Data Model: putting it all together

edit David P. Janes 2006-07-16 18:55 UTC add comment  ·

In this series of posts on Google Base (read them all), we've been describing parts of the Google Base data model. In this post, we'll attempt to put it altogether. It's important to note that you can try out using the Google web UI just about everything we're discussing here (if you have a Google login, which you probably do):

Google Base is fairly well documented. We've been using the "bulk upload" section to find most the info we've been discussing in this series. Here's the important docs if you want to read through them:

Now, onto our summary. So far we've learned:

Additionally:

  • the Google Base data model is very simple -- there is virtually no structure except "this item has these attributes"
    • i.e.  not unlike a fairly standard non-nested struct definition in C, or a row in a CSV database
  • there is no nested structure, except what is defined in the basic Data Types
  • the Google Base data model is very flexible, within the bounds of what can be done in the bounds of the previous points. You're free to invent anything or add anything to anything else, as long as it's built on the basic Data Types

Next, we'll discuss a few outliers and where this data model could go in the future.

Google Base "Information Types" aka "Item Type"

edit David P. Janes 2006-07-16 16:11 UTC add comment  ·

The front page of Google Base gives you two basic choices to "Post an item": "Choose an existing item type" or "Create your own item type". The second option indicates the basic flexibility of Google Base, that there's little to a Google Base "information type" beyond being a collection of Attribute Types (attributes discussed here, the basic types that attributes are composed of are discussed here).

The predefined Information Types are:

Note that this list (from here) actually seems to be out of date with what Google is actually supporting. For example, there's a Podcast information type available on the main search page.

A description of what Attribute Types are expected to be seen in each Information Type are found here. Google encourages you to define as many applicable Attribute Types as possible when filling in items to "greatly increase your item’s chances of showing up in search results".

The main difference (as far as I can tell) between the predefined Information Types and your own is that Google actually is planning to do stuff with the predefined types: i.e. houses for sale, geographic searches, and so forth. However (and again, as far as I can tell) all items stored in Google Base are eligible to show up in Google Base search results.

And finally, you can add any predefined Attribute Type to any Information Type record, even if it isn't formally defined there.

Google Base Attribute Types

edit David P. Janes 2006-07-16 14:28 UTC add comment  ·

Google Base defines a standard set of "attributes", built upon the basic data types previously mentioned here. These attributes are used to define "information types", which is basically a complete logical record in the Google Base db (we'll explore these soon). This list is open ended, in that Google can and almost certainly will define more attributes to go into this list (as Google adds more information types). You are free to reuse these attributes in an information type, even 

These are all pretty self explanitory and click on any of the links to get more information.

Google Base Data Types

edit David P. Janes 2006-07-16 14:10 UTC add comment  ·

Google Base (as far as I can tell) defines everything in terms of a few underlying basic data types. These are defined here* and are:

  • string
  • int
  • float
  • intUnit
  • floatUnit
  • date
  • dateTime
  • dateTimeRange
  • url
  • boolean
  • location

When you're editing a Google Base item the types you can dynamically use in the web UI are:

  • text
  • number-unit
  • number
  • date-range
  • large text
  • web url
  • checkbox
  • location

If you squint closely enough, you can see this list more or less maps back to the underyling types. In addition Google Base provides for enumerations, which are strings restricted to a list. For example, the salary_type enumeration takes one of two values: “starting” or “negotiable”. I have not performed any experiments yet to see if Google Base actually enforces the enumeration.

* I know this page is defining something else, but it's all the same database, isn't it?

Exporting Structured Data into Google Base

edit David P. Janes 2006-07-16 13:11 UTC add comment  ·  ·  ·  ·  ·

This is the first of several posts I'll be making about Google Base -- and in particular, the RSS/Atom "bulk upload" format which extends those XML formats with addition information that allows Google Base population.

We're working on a project to demonstrate structured data for "sales lead". In terms of standard "exisiting" structured element, this has a contact person, company, phone number and address. In addition, we extend it with Product Name, Percentage Closed, Close Date and so forth. The title of the entry represents the Opportunity and the body is for other comments.

You can see an example of this here. If you're interested in what the blogmatrix.cfg for this looks like, I've attached a sample snippet.

We haven't done anything particular clever yet. In particular, we'll be adding the ability to query against Percentage Closed using tags, items past the close date, and maybe a few other things for the demo.

What's really neat is that we can export this into our RSS feed also, using the Google Base definitions (a mix of predefined type and some we've made up on the spot). You can view the feed (for this one entry!) here or here's the important part (reformatted for readibility):

 <rss version="2.0">
 <channel>
 ...
  <item>
   <title>
    The potential to sell 10000 shiny pennies
   </title>
   <link>
    http://home01.semantic.blogmatrix.com/:entry:home01-2006-07-13-0008/
   </link>
   <g:product_type>
    Penny
   </g:product_type>
   <gc:sales_status type="string">
    Still looking for a sucker
   </gc:sales_status>
   <gc:percent_closed type="int">
    0
   </gc:percent_closed>
   <gc:person hcard:type="fn" type="string">
    Johnny Q. Public
   </gc:person>
   <gc:organization hcard:type="org" type="string">
    Bank of Canada
   </gc:organization>
   <gc:job_position hcard:type="title" type="string">
    Secretary to the Undersecretary
   </gc:job_position>
   <g:location>
    1 Bank Street
    Ottawa, Ontario
    Canada
   </g:location>
   <gc:phone_work type="string">
    605-666-6666
   </gc:phone_work>
  </item>
 </channel>
</rss>

More to follow...

Attached Documents:

Bill Burnham on Google Base, Google Real Estate, Google Home, Google ...

edit David P. Janes 2006-07-12 12:02 UTC 9 comments  ·  ·

Bill Burnham has a post about what Google is doing these days:

With the launch of these Google Base front-ends, Google is clearly putting into place the major pieces required to support its vertical search platform.  Broadly speaking, such a platform requires 4 major pieces:

  1. A big, highly scalable database that can handle lots of queries.  This, of course, is what Google Base was all about.
  2. Consumer friendly front ends to access these databases.  The auto and real estate front ends are obviously the first of such front ends.
  3. A large, robust, crawling farm.  This is obviously Google’s crown jewel.
  4. A set of intelligent algorithms to find, classify, and flag listings.  We have yet to see this from Google.

Most people remain unimpressed by Google Base because it doesn’t seem to contain a lot data.  That’s because what you are seeing is a work in progress that is being purposely hobbled to reduce load during the testing phase.   Google has now built beta versions of pieces #1 and #2.  We will un doubtedly soon see pieces #3 and #4.  Only when those pieces are in place will Google Base fulfill its potential.

The second half of the post is called "Losers and Bigger Losers":

There will be two sets of losers in all of this.  The first and most immediate set of losers will be the start-up vertical search players (indeed one can only imagine the long faces at Trulia (and their VC backers) when they got their first look at Google Real Estate).

...

The second set of losers in this are the well established listings-focused Walled Gardens of the Internet.  As I have outlined before in detail, these Walled Galled face a fundamental threat from search.  A fully functioning Google Base will make that threat more real than ever.

What Bill hasn't mentioned is who can be the winners in this situation (besides Google!). I'll answer the question: us. The structured data component of this blogging platform is designed to populate Google Base. We've already demoed populating events into Google Calendar (not live, because of password issues). We're going to expand this idea (and do it correctly) using Google Account Authenication and then uses the GData APIs to populate pretty well anything you see in Google Base.