The Semantic Web is the name given to a technology that has been under development since 1998, designed in large part by Sir Tim Berners-Lee (inventor of the World Wide Web) and the World Wide Web Consortium (or W3C). It is also known by other names such as “Web 3.0″ or the “Linked Online Data (LOD) Cloud”. At its core, the Semantic Web is based on the Resource Description Framework (RDF), which is a new way to describe and link information together. RDF is to the Semantic Web what HTML (the HyperText Markup Language) is to the World Wide Web.
Better Living Through Artificial Intelligence
I believe that the Semantic Web has the potential to radically transform our lives, as it ushers in a new age of artificial intelligence that we’ve so far only seen in science fiction movies. As its adoption grows and spreads, the information contained within will touch every aspect of human life, including education, medicine, popular culture, transportation, politics, history… and beyond.
In this series of blog posts, I will attempt to illustrate how it works, and give a glimpse into the bright future of information technology. This first post will focus on one of the deep problems that the Semantic Web solves.
[The Semantic Web will create] a Web in which machine reasoning will be ubiquitous and devastatingly powerful.
— Sir Tim Berners-Lee (1998; The Semantic Web Roadmap)
The Semantic Web proposes a new paradigm for organizing information known as Hyperdata, which will be key to the evolution of information technology. Hyperdata is the intersection of metadata (information about information) and hypermedia (information linked to other information). I’ve written about metadata and hypermedia in previous blog posts.
Metadata is excellent at expressing facts about existing information. Examples include the date that a blog post was published, and tags that describe the contents of a file. Metadata is useful in search contexts: the more information that is available about a data object, the more ways that exist for searching and finding exactly what you’re looking for.
At the same time, metadata has a distinct shortcoming: it is only descriptive in a very narrow sense. An example will illustrate this point.
The Record Store
Suppose you have a small collection of music that you want to sell online. In order to inventory and track your collection, you opt to create a catalogue listing everything you have for sale, and display it online so that visitors to your website can shop for and find exactly what they’re looking for, or browse through lists of what you have available.
As a first step, you create a spreadsheet with some basic columns such as “Title”, “Artist”, “Year” and “Format”. In the vast majority of cases, these four aspects of any given music album are enough to give potential customers a very good idea of what they will be buying. Most shoppers looking for a specific item in your store will be able to find exactly what they want by searching on one or more of these four criteria.
So you create a blank spreadsheet that looks like this:
Your first catalogue entry is R.E.M.’s seminal “Lifes Rich Pageant” (sic.). You have an album on vinyl as well as one on CD, and you want to sell both. So, you make the vinyl version the first entry in your catalogue:
So far so good.
Next, you want to add the CD album, the 25th anniversary reissue of “Lifes Rich Pageant”. So, you add a second entry to your catalogue:
Already, you’ve started to become aware of three problems with the spreadsheet you’ve designed:
- There’s no “Edition” column, so in order to distinguish the two albums, you have to cram more explanatory information into the title column. However, R.E.M. never released an album entitled “Lifes Rich Pageant (25th Anniversary Reissue)”. These are merely two editions of the same album, “Lifes Rich Pageant”. If a customer were to search the website for albums with the exact title “Lifes Rich Pageant”, he would only see one result, the LP. That customer might not have a turntable, might have been very happy to buy the 25th Anniversary Reissue on CD, but unfortunately didn’t find the CD version in his search. The metadata framework and its lack of expressiveness has cost you a sale.
- The same problem exists in the “Format” column. By writing “2CD”, you meant that the reissue is actually a double CD, even though, strictly speaking, there isn’t a recording format called “2CD”. As you grow your catalogue, how will you indicate a triple CD? How about a concert album that includes both a CD and a DVD of the performance? What about a box set? Further, is it a vinyl box set? A cassette box set? A CD box set?
- When creating your spreadsheet, you assumed that the concepts of “Recording Year” and “Release Year” were the same, and so you only created a single “Year” column. Again, your spreadsheet has proven to be lossy in the way it catalogues the music, and so you have had to make a choice that “Year” actually means “Release Year”.
Discouraged but undeterred, you add a third entry to your catalogue, a compilation entitled “Atlantic Rhythm and Blues: 1947-1974″:
Again, you notice problems with your spreadsheet:
- This third album, a collection of Atlantic Records’ greatest hits spanning the years 1947 to 1974, was released in 1991. But imagine a shopper who says: “I’d love to find soulful bluesy music from the 50s or thereabouts”. Although this third album in your catalogue would fit the bill perfectly, it would never turn up in a search for “1950″, since that number appears nowhere in either the Title or the Year columns.
- There is no artist named “Various Artists” — this is simply a way to express “There is no single artist for this album”. Sadly, the customer who comes to your website, searching for a Big Joe Turner song she heard on the radio, will be disappointed because none of her search queries for “Big Joe Turner”, “Midnight Special”, or “1957″ will yield any results, despite the fact that it describes precisely something that you have and are selling!
(The first problem is not insurmountable. You could write some computer code to make the search engine a little smarter: “Assume any search for a number refers to a year, and include in the search results any item in my inventory that has a range of years (two numbers separated by a dash) which encompasses the sought-after year”.
With this code, a search for “1950″ would successfully retrieve “Atlantic Rhythm and Blues: 1947-1974″.
This code adds a little intelligence to your search engine, but not much, because the search engine has only been instructed to recognize years in a very specific format (two numbers separated by a dash). This logic would not work on a title such as “Glittering Prize 81/92″ (by Simple Minds), since the two numbers are not separated by a dash, and would lead to puzzlement when Rod Stewart’s box set “The Great American Songbook (Vol. 1-5)” appears in the results of a search for Violent Femmes’ album “3″.)
As a result of your non-expressive framework, you lose yet another sale. Disenchanted with metadata, you decide to shut down your music store and start a bakery instead. (I have to say, you give up pretty easily!) Because of metadata’s lack of scope, many customers have been disappointed, and an entrepreneur’s dreams have been crushed.
The Shortcomings of Metadata
To summarize, metadata is excellent at giving information about information, as long as that information all fits neatly into a grid. This is acceptable in the world of computers where objects such as files and blog posts are strictly defined, but human life rarely fits neatly into a grid.
To fix the problems with metadata that are described above, you’d have had to add extra columns every time you encountered an exception (e.g. renaming the “Year” column to “Release Year” and adding an extra “Recording Year” column). This is very easy when your music catalogue contains 3 items, but it doesn’t scale well. As your catalogue grows, more and more exceptions will be encountered; each time a new column is added, or an existing one is renamed, every entry within has to be checked and edited, to make sure the information is accurate and error-free. Try doing this for a music catalogue the size of Amazon’s or GEMM’s!
This is not a purely academic problem, either. Metadata’s lack of expressivity is found in everyday life. Anytime you fill out a form, you are entering information into a grid. If you live your life in a normalized, structured fashion, this is probably fine. If, on the other hand, you have stories to tell, you’ll likely spend some time explaining why your life doesn’t fit into the grid.
As someone who recently relocated to the US, I can testify to this phenomenon first-hand. It is difficult to sign up for something like cable internet service, for instance, when you don’t have a credit history or a Social Security Number. You usually end up leaving those parts of a form blank and/or calling a helpdesk. If you’re lucky enough to get through to a customer sales rep or government official who can help you, they still usually need to ask around to find out how to handle this exception (you) in their system, and a lot of confusion and difficulties may arise as a result.
In the next post, I will examine hypermedia, which proposes a different, much more expressive way of codifying information. It also has non-trivial shortcomings, which is why I’ll conclude by looking at hyperdata, as the intersection of metadata and hypermedia, and describe how it proposes a solution to both problems.