< January 2003 >

31st: Just FOAFing around. I've managed to get my FoafHarvester up and running, it is currently gathering some data for a foaf exploration application I am building. All told I picked up about 350Kb of FOAF data during the crawl. Other than a couple of minor glitches it went quite well. The data is currently in an XML file, I'm just writing a program to put all the data into an SQLServer database at the moment.

Some useful FOAF stuff. If you're interested in learning more about FOAF then the best bet is the RDFWeb FOAF page, this links to all kinds of good stuff so I won't replicate it here. Well apart from FOAFNaut, and this presentation I found "Photo RDF, Metadata and Pictures" that talks a little about FOAF.

29th: Setting up a serious development system. The time has finally come that all my programming projects are getting a little bit unwieldy to manage, version control consisting of backing up all the files now and again to a different directory. I've finally got around to installing CVS on my computer. One of the advantages of this is that my project folders have suddenly become a lot more organised as I can just keep old projects in the CVS and out of my frequently used folders.

After setting all this up I'm beginning to feel more like a serious programmer, my development tools are taking shape. As I am a bit of a geek I'll give you a rundown of what my current programming setup is like.

Add a few batch scripts I've written, a few custom macros and commands to integrate my text editor with the .Net command line tools and voila, a nice little development system.

27th: RSS Tidy Anybody? I've been thinking about how to deal with the malformed Xml feed problem that I was talking about a few days ago, after reviewing some of the information contained in the thesis on invalid HTML parsing I shifted my emphasis to solving the problem to before submitting the document to the XML processor.

Where's the Focus? Most solutions that involve delivering valid RSS to an aggregator concentrate on the publishing tool that creates the document, this is undoubtedly the ideal solution, however how are we going to get the Competitive Advantage that Mark discusses in his weblog? Simple, we need a tool that rescues the RSS file and nurtures it to health (or wellformdness) and then passes it into the XML processor for processing.

Lets get past the First aid shall we? What I want is a solution that doesn't just try and put a band aid on the problem but prescribes the medicine and puts it in a sling. Rather than solving each error as it appears I think that a better approach would be to use something analogous to HTMLTidy to sort out all the illegal characters, entities and missing end tags. Kevin talked about his take on "smart" parsing which I mostly agree with, especially when he says this functionality is needed for all XML aggregation formats not just RSS. If we resort to using regular expressions for this task then we have to write a new set of RegEx's for eery new type of Xml format we want to process. If we resort to fixing errors in an ad hoc manner then we are missing some of the elegance and formalism that comes with a well structured library that deals with the problem.

In summary: lets not break the Xml parsers, lets fix the Xml, if we can do that on the server so much the better, if we have to do it on the client then lets do it in a structured way that scales across formats, lets handle the errors gracefully. Let's get well formed first, and then worry about validity.

26th: Just some old school webloging. The W3C released a final call on several RDF working drafts, I've been reading over the RDF primer and getting more of a feel for how RDF can be used, one of the most interesting parts is section six which introduces some existing RDF applications, it's nice to see it being used for something other than RSS!

I really like this. One of my favourite blogs,, has just posted a really interesting entry, anyway it helps to break up the endless technical stuff I end up reading most of the time.

Ten fold strong with the Tech. Lets get back on topic shall we, I've been having a couple of feature issues with implementing the functionality I want on my RSS feed. Basically I was questioning whether I could use xml:lang to identify the language in my post descriptions, I found a well thought out answer over on RSS-Dev, along with some feedback on my idea of including namespaced XHTML in my RSS feed.

25th: Liberal Parsing. There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS[additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].

So now you've read all that you can see what my point is.

You can't? Ok I'll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI's [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I've included a data table based on the results of the analysis below. Luckily RSS isn't in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?

CategoriesNumber of Documents% Attempted Validations (2 dp)% Total Requests (2 dp)
Invalid HTML documents203478899.2984.85
Not Downloaded225516NA9.40
Unknown DTD123359NA5.14
Valid HTML documents145630.710.61
(All) Grand Total2398226100.00100.00

PS I've just worked out a few bugs in my weblog RSS feed, enjoy.

23rd: Building a Robots.txt parser I'm currently building a robots.txt parser for a project I'm running in C#. It is quite interesting as I have been meaning to get more into C# but have found difficulty finding the time. Luckily the grammar behind robots.txt files isn't to difficult to parse and I'm currently running a small scale test.

After I'm satisfied with how the parser works I'll release the source code as a class for C#, I am basing it broadly around the functionality offered by Pythons robotparser.

Picture of a Desklamp balancing on a stack of books.
Current Reading: The two books upon which the light is balancing are Professional C# 2nd Edition and Professional ASP.Net 1.0, both from Wrox press and both recommended by me.

22nd: RSS Enabled In between my hectic revision schedule I've knocked up a fairly basic RSS scraper for this site. My CMS hasn't gotten of the ground yet and I really wanted to begin providing an RSS feed for my weblog, I couldn't wait any longer. I wrote it in C#, it isn't tied into anything on the live site as I run it on my local machine to generate the RSS file.

A few little bugs. There do seem to be a few teething problems with some RSS aggregators due to my use of content encoded data (as per RSS 1.0 content module spec), anyway I'll iron out some of the issues as time goes on.

20th: Welcome to my "Secret Evil Website" Browsing my logs yesterday I found out someone searched for secret evil website on google and my site was the top result, I didn't think my site was all that evil.

A touch of real life. This website isn't really about my life as such, the subject matter usually stays rooted to technical stuff that I'm interested in, in a break from the normal run of conversation I'm getting married soon! Things are going well and I'm looking forward to a nice wedding in a nice little Spanish town, followed by a religious ceremony in Madrid.

Glass chess pieces, the King and the Queen.

13th: Semantic Vs Presentation. A recent post to a newsgroup I visit, CIWAH, sparked a small yet interesting debate concerning the differences and similarities between semantic and presentational markup in HTML. Daniel Tobias started the ball rolling with a short piece that covers a few of the well worn points in this discussion. The article however does reinforce the need for people to understand what they are writing and not to abuse tags in a meaningless fashion.

In response to Dan's post a more interesting point was raised by Jukka K. Korpela when he stated that:

Physical markup does not define the semantic meaning (except in the trivial sense where we might say that the visual presentation is the meaning), but it may carry a connotation, at least when taken in context.

In the discussion of logical vs presentational I think that the point Jukka makes is important, as well as one that is often glossed over by proponents of semantic markup. Recognising that web based content may be delivered in varied contexts, and demonstrating that semantic markup is the best method for making this delivering in varied contexts is a key method of convincing web developers to use semantic markup.

An example of why semantic markup is important may be found in this post itself, when quoting Jukka's post I added structural HTML constructs to give emphasis to certain words. In many browsers these ephasised words will appear in boldface or italics, how can non-visual browsers pass this information onto their users? Well first of all I have used <strong> and <em> tags to markup the emphasis rather than <b> or <i> tags, because the tags I used are defined semantically the relationship between the differently emphasised words is clear, and an aural browser will be able to easily pass on this information. In contrast the tags that merely make the text boldfaced or italicised have no defined relationship with each other and is open to differing interpretation, by using semantic tags the relative importance of the words is effectively gagued and can be taken into account during the presentation of the content.

The example I have given demonstrates one advantage semantic (logical) markup has over presentational markup, an enumeration of the arguments supporting semantic markup has been proposed as:

  1. Logical markup can be mapped to varying physical presentations depending on presentation medium.
  2. Logical markup can be automatically processed in a manner that is based on the defined logical meanings of elements.
  3. Logical markup leads to more flexible ways of affecting the visual presentation and creating alternative presentations.

Abbreviations versus acronyms. This issue raises its head again, it is a somewhat confusing topic, for my stance on the issue remains the same.

12th: Google Vs SearchKing One of the big stories that has been circulating recently is the legal wranglings between Google and SearchKing. In a reading some of the commentary on the case there were several aspects that interested me, especially peoples seeming willingness to turn search engines into regularised utility companies. This subject has already been covered elsewhere:

Google is so important to the web these days, that it probably ought to be a public utility. Regulatory interest from agencies such as the FTC is entirely appropriate, but we feel that the FTC addressed only the most blatant abuses among search engines. Google, which only recently began using sponsored links and ad boxes, was not even an object of concern to the Ralph Nader group, Commercial Alert, that complained to the FTC.

In my opinion however such regulation should not be imposed upon these companies, what is often unacknowledged is that many of these internet "giants" are not just used by US citizens, I live in the UK and I do not feel that restrictions should be placed on google by the US system that would adversely affect my search experience. In any case the following quote seems to echo many of the sentiments of my own views.

It's possible to read this case as a case about media regulation. Maybe Google is a common carrier; in agreeing to rank pages and index the Internet, it has (implicitly) agreed to abide by a guarantee of equal and non-discriminiatory treatment. On this view, it would be immensely important whether Google devalued SearchKing specifically, or as part of a general algorithm tweak. A great deal may also hinge on whether you think that Google provides access to information or merely comments on it. SearchKing alleges the latter, and Google agrees, but maybe SearchKing should have brought its case by arguing that Google has become, in effect, a gatekeeper to Internet content. On that view, a low PageRank isn't just an opinion, it's also partly a factual statement that you don't exist in answer to certain questions, on the basis that low search results are never seen. When was the last time you looked for results beyond 200 on a search request returning 20,000 pages?

These are very messy questions, but also very important ones. They're also very unlikely to be addressed directly in the courtroom, in this case or in other cases. Existing law just comes down too squarely on Google's side (I think) for courts to take these broader questions without mutilating our existing rules. Nor should they. Not everything should be settled in the courtroom, and the discussion about the proper role of search engines is one that needs to take place in the same place this case began, back before it was a lawsuit: out on the Internet, where people read and appreciate others' thoughts, and then contribute their own by adding links. Among other things, Google is a device for determining the consensus of the Web; and it's just not right to fix the process by which we determine consensus by any means other than honestly arriving at one.

Perhaps as the internet, and the information contained on it, becomes more important to us as a society the answers we think we already have will have to be re-evaluated.

7th: XHTML as CMS Language? There has been some discussion recently concerning content management and the role of HTML in that process. First of all Brian Donovan states that you need to avoid poisoning your content with HTML, the points mase make quite a lot of sense in certain contexts. In fact while reading "The Content Management Bible" I came across some similar thinking. The basic proposition is that by keeping HTML out of the content you can reuse the content in many other areas. This is an interesting viewpoint and one I tend to generally agree with, the importance however is in the context.

What do I mean when I talk about all this context malarky? Well, as observed elsewhere its a lot of effort to organise your content using databases and content management systems, they can help to automate a lot of work but they are not a trivial enterprise, the context of a situation helps to determine the solution employed in that situation, a weblogger is not generally in need of a fully blown CMS whereas a news organisation is almost certainly in need of one.

The context I am working in, with respect to this weblog, is strictly smaller scale. Can XHTML be used as the CMS language? What benefits, for the small scale webite, does a CMS provide that cannot be provided using XHTML? The seperation of content from presentation, an often heard mantra among netheads, can be achieved using XHTML. How can this be done? well drop all the depreceated, presentational aspects of HTML and embrace CSS and strict DOCTYPES. Yaddayaddayadda...

Whats that dozed off? Lets get back on track then, I said I was talking benefits, not features. The benefit is that your content doesn't have to change everytime you change your design, this isn't always as simple as it sounds for large scale changes, but simple site wide changes can be made very easily. Proper use of the semantic elements of XHTML can help to tie together your site, you just need to know what the semantics are. [abbreviations] [definitions]

Thining about all this has reminded me of an experience at a company I worked for where we were changing over from Lotus applications, such as 123 and AmiPro, to the Microsoft Office suite. At the time I worked in accounts and I am sure you can imagine the amount of spreadsheets and documents that needed to be converted within the department, if you can't the number was around 20,000. I was given the job. What it taught me was that if you need to make the wholesale conversion from one format to another then use good tools to do it. How does this relate to website design? Well content has been passed on from one generation of program to another for a while now, don't worry too much about it but when you need to do it organise yourself and use good tools.

1st: Happy New Year! I'm not going to write much today [not that I write much anyway], I am still recovering from partying/dancing to 7am this morning in Madrid and then flying back home to good old England this afternoon. May you have a prosperous New Year!

< January 2003 >