More XML

(Apologies to those of my readers that aren’t interested in this stuff. I’ve been giving more time & attention to my work of late, and the results are less blogging, and technical stuff being on the top of my mind more than current affairs)

Very good piece by Jim Waldo of Sun that chimes (in my mind at least) with my piece below. He emphasises the limited scope of what XML is. He doesn’t echo my discussion of whether XML is good, rather he shoves that aside as irrelevant – the comparison is with ASCII. We don’t spend much time arguing over whether ASCII is a good character set – is 32 really the best place to put a space? Do we really need the “at” sign more than the line-and-two-dots “divide-by” sign? Who cares? The goodness or badness of ASCII isn’t the point, and the badness of XML isn’t really the point either.

The comparison with ASCII is very interesting – Waldo talks about using the classic Unix command-line tools like tr, sort, cut, head and so on that can be combined to all sorts of powerful thing with data in ascii line-oriented data files. XML, apparently, is like that.

Well, yes, I agree with all that. But, just a sec, where are those tools? Where are the tools that will do transforms on arbitrary XML data, and that can be combined to do powerful things? It all seems perfectly logical that they should exist and would be useful, but I’ve never seen any! If I want to perform exactly Waldo’s example: producing a unique list of words from an English document, on a file in XML (say OOWriter‘s output), how do I do it? If I want to list all the font sizes used, how do I do that? I can write a 20-30 line program in XSLT or perl to do what I want, just as Waldo could have written a 20-30 line program in Awk or C to do his job, but I can’t just plug together pre-existing tools as Waldo did on his ascii file.

There are tools like IE or XMLSpy that can interactively view, navigate, or edit XML data, and there is XSLT in which you can write programs to do specific transformations for specific XML dialects, but that’s like saying, with Unix ascii data, you’ve got Emacs and Perl – get on with it! The equivalents of sort, join, head and so on, either as commandline tools for scripting or a standard library for compiling against, are conspicuous by their absence.

The nearest thing I can think of is something called XMLStarlet, but even that looks more like awk than like a collection of simple tools, and in any case it is not widely used. Significantly, one of its more useful features is the ability to convert between XML and the PYX format, a data format that is equivalent to XML but easier to read, edit, and process with software (in other words – superior in every way).

As a complete aside – note that pyx would be slightly horrible for marked-up text: it would look a bit like nroff or something. XML is optimised for web pages at the expense of every other function. That is why it is so bad.

Maybe I’m impatient. XML 1.0 has been around since 1998, and while that seems like a long time, it may not be long enough. Any process that involves forming new ways for people to do things actually takes a period of time that is independent of Moore’s law, or “internet time”, or whatever. The general-purpose tools for manipulating arbitrary XML data in useful ways may yet arrive.

But I think the tools have been prevented, or at least held up, by the problems of the XML syntax itself. You could write rough-and-ready implementations of most of the Unix text utilities in a few lines of C, and program size and speed is excellent. To write any kind of tool for processing XML, you’ve got to link in a parser. Until recently, that itself would make your program large and slow. The complete source for the GNU textutils is a 2.7M tgz file, while the source for xerces-c alone is 7.4M. The libc library containing C’s basic string-handling functions (and much more) is a 1.3Mb library, xerces-c is 4.5Mb.

If you have to perform several operations on the data, it is much more efficent to parse the file into a data structure, apply all transformations on the data, and then stream it back to the file. That efficiency probably doesn’t matter, but efficiency matters to many programmers much more than it should. It takes a serious effort of will to build something that uses such an inefficient method. Most programmers will have been drawn irresistibly to bundling a series of transformations into a single process, using XSLT or a conventional language, rather than making them independent subprocesses. The thought that 99% of their program’s activity is going to be building a data structure from the XML, then throwing it away so it has to be built up again by the next tool, just “feels” wrong, even if you don’t actually know or care whether the whole run will take 5ms or 500.

In case I haven’t been clear – I think the “xmlutils” tools are needed, I don’t think the efficiency considerations above are good reasons not to make or use them, but I think they might be the cause of the tools’ unfortunate non-existence.

I also don’t see how they can be used as an argument in favour of XML when they don’t exist.

See also: Terence Parr – when not to use XML

XML Sucks

Pain. Once again, I have had to put structured data in a text file. Once again, I have had to decide whether to use a sane, simple format for the data, knocking up a parser for it in half an hour, or whether to use XML, sacrificing simplicity of code and easy editability of data on the altar of standardisation. Once again, I’ve had to accept that sanity is out and XML is in.

The objections to XML seem trivial. It’s verbose – big deal. It has a pointless distinction between “element content” and “attributes” which makes unneccessary complexity, but not that much unnecessary complexity. It is hideously hard to write a parser for, but who cares? the parsers are written, you just link to one.The triviality of the objections are put in better context alongside the triviality of the problem which XML solves. XML is a text format for arbitrary heirarchically-structured data. That’s not a difficult problem. I firmly believe that I could invent one in 15 minutes, and implement a parser for it in 30, and that it would be superior in every way to XML. If a solution to a difficult problem has trivial flaws, that’s acceptable. If a solution to a trivial problem has trivial flaws, that’s unjustifiable.And yet XML proliferates. Why?Since the only distinctive thing about it is its sheer badness, that is probably the reason. Here’s the mechanism: There was a clear need for a widely-adopted standard format for arbitrary heirarchically-structured data in text files, and yet, prior to XML none existed. Plenty of formats did exist, most of them clearly superior to XML, but none had the status of a standard.Why not? Well, because the problem is so easy. It’s easier to design and implement a suitable format than to find, download and learn the interface to someone else’s. Why use someone else’s library for working with, say, Lisp S-expressions when you could write your own just as easily, and have it customised precisely to your immediate needs? So no widely-used standard emerged.On the other hand, if you want something like XML, but with a slight variation, you’d have to spend weeks implementing its insanities. It’s not worth it – you’re be better of using xerces and living with it. Therefore XML is a standard, when nothing else has been.This is not the “Worse is Better” argument – it’s almost the opposite. The original Richard Gabriel argument is that a simple, half-solution will spread widely because of its simplicity, while a full solution will be held back by its complexity. But that only applies to complex problems. In heirarchical data formats, there is no complex “full solution” – the simple solutions are also full. That is why we went so long without one standard. “Worse is Better” is driven by practical functionality over correctness. “Insane is Better” is driven by the (real) need for standardisation over practical functionality, and therefore the baroque drives out the straightforward. Poor design is XML’s unique selling point.

Large and Small Organisations

Very good piece by Arnold Kling on the differences between large and small organisations.

If large organizations are dehumanizing, then why do they exist? Brad DeLong says that my assessment of large organizations must be incorrect, or else we would not have Wal-Mart.

A point Kling doesn’t make about Wal-Mart is that it is a fairly young organisation. It was in the 1970s that it became a really large organisation, and in the 1980s that it became spectacularly huge. As I have pointed out previously, it is over time that the bad effects of states and other large organisations accumulate. After thirty years, Wal-Mart is a very effective organisation, but one would expect the problems to start soon. The massive state-managed economy Britain instituted in the 1940s started falling apart in the 1970s, and the Soviet organisation set up through the 1920s and 30s probably peaked in effectiveness in the early 60s. Small organisations can stay effective indefinitely.

This piece by Paul Graham is also relevant – describing the Venture Capital / takeover cycle as a way of getting more of the best of both worlds.

Death toll

OK, so the death toll from the Great North Run matched that of the Hatfield rail crash.

I wonder how long the court case will last?

Dr Andrew Vallance-Owen of BUPA said, “At BUPA we encourage everyone to take an active interest in their health and running is a great way to keep fit. This year BUPA is sponsoring six runs including the BUPA Great North and BUPA Great South Runs.”

Oops, wrong page. That was last year. Actually, he said that fun-runners who failed to prepare properly for such gruelling events could suffer heart attacks. (Metro, Monday 19 Sep).

Not that there is any evidence that the victims did fail to prepare properly. The brother of 28-year-old Reuben Wilson said that Wilson had trained for the race. The immediate assertion that “if it didn’t work, you weren’t doing it properly” is one of those things that I generally find very annoying. Facts first, please, then conclusions.

Seriously, I don’t think that the organisers of the race should be considered liable for the deaths that occurred. But there is at least is much justification as in many other cases of accidental death, including Hatfield.

Politeness

There are two views of politeness. One is that it’s a kind of magical fairy-dust that you can add to whatever you do by using meaningless words like “please”.

That might be OK for teaching toddlers, but it’s rubbish.

Real politeness is caring about other people. “please” isn’t meaningless, it’s a contraction of “if you please”, and it means that you’re recognising that the person you’re talking to might not want to do what you’re asking, and that you’re accepting that they might choose not do it.

Giving an order including the word “please” isn’t polite, it’s gibberish. Saying “please” isn’t polite, unless you mean it.

Now the message you get if you go to http://www.legos.com/

“… We would sincerely like your help … Please always refer to our products as LEGO bricks …”

Is, as far as I can see, genuinely polite. They’re not giving orders or making threats. They’re pointing out what they call the stuff they make, and saying that they’d prefer it if their customers called it the same. There’s nothing to suggest that they are unaware that Cory Doctorow or anybody else can call it whatever they like, but like other global companies these days, they prefer to call their product by the same name everywhere (Snickers, anyone?). Unlike Mars, they can’t rename their product from “Legos” to “LEGO”, because it was never Legos in the first place, it’s just that Americans seem to be a bit confused. So they’ve made this polite request. Complaining about seems ridiculously touchy.

The problem here is not BoingBoing, it is the people who never got beyond toddler level, who don’t know the difference between speaking politely and being polite, who say “please do not smoke here” when they mean “if you smoke here we’ll send security guards to throw you out”, who say “please do not copy this CD” when they mean “if you copy this CD we’ll sue you for $100,000”. They leave us in the position where we’re not quite sure whether the Lego message is insufferable bossiness or a mild request.

On reflection, the motive might not even be marketing. It might just make their skin crawl to hear the word “legos”. Mine does, a little, and I’m nothing to do with the company at all.

For my to-do list

I don’t like hardback books.

They’re too bulky, too heavy, and the dustjackets rip easily. I want a book I can shove in the already-crammed pocket of my laptop bag and read on the train. Paperbacks meet the need, and they’re cheaper too!

The only thing the hardbacks have going for them is that they’re available first, when the publicity hits. So what I want is some kind of application that I can notify when I hear about an interesting book, and which will let me know when the paperback comes out.

This one that Tim Worstall has picked up a review of would be a first choice: “Why Most Things Fail” by Paul Ormerod.

On a related note, “Freakonomics” is just out in paperback.

CPRE propaganda

In the news today: some utter, utter drivel from the Campaign to Protect Rural England.

If it had happened all at once, there would have been a huge outcry; determined, concerted action. But it didn’t; it happened over several decades – gradually, incrementally, without anyone really knowing who was responsible, or whether it was anything to do with them. And so those who can remember how things used to be look back uneasily. They find it hard to believe that it happened. But it did. It’s 2035, and the countryside is all but over.

The report itself is 48 pages. It will take me a while to give it the fisking it deserves, though I hope to get round to it. My first pass was to look through it for any evidence at all that would seem to contradict the key relevant fact, that Britain is mostly empty.

The report does state that the developed area of Britain is increasing (by 21 square miles per year, apparently), but nowhere does it put this in context of the area which is undeveloped. The nearest it gets to such a claim is the last bullet point on page 15:

“the total area of ‘tranquil countryside’ declined by 20% between 1960s and 1994, and continues to do so”.

The source for this claim is a 10-year-old publication by the same organisation.

They do make some accurate points: Farming is declining (good!). Light pollution is an aesthetic problem (can be fixed, by, er, pointing the lights downwards, and should be.) Some bird species are in dangerous decline (but how much of that is caused by changing farming methods rather than encroaching development?). But the central claim is that we are running out of countryside, and that claim is utterly false, and indeed is made dishonestly, since they surely must have noticed that there was no evidence to support it, when they looked for some to put in their report and couldn’t find any.

Related posts:
Crowded Island
The War on Housing

Why Clarke must lose

I don’t much like this government. I don’t like Blair, and I don’t like Brown. Their centralising, high-spending, high-taxing, interventionist policies are damaging the economy and the nation.

But set against the whole context of nasty statist politicians, they’re not exceptionally bad. They get some stuff right – like this from Gordon Brown.

Somehow I can’t see Ken Clarke making that speech. And that’s his problem. For all his cuddly image and centrist appeal, if it comes down to an election between Gordon Brown and Ken Clarke, I think I’d prefer Brown. It’s kind of a “hanging or electrocution” type question, but there it is.

The trouble with ambiguity is that people suspect the worst. If the Tory party campaigns on a platform of “we want to cut spending but we’re not going to”, voters who don’t want cuts will expect to get them, and voters who do, won’t. Everybody will be put off.

Trade and Peace

Various observers have picked up on the new report from The Fraser Institute:

Economic freedom is almost 50 times more effective than democracy in diminishing violent conflict between nations, according to the Economic Freedom of the World: 2005 Annual Report.

Well, if trade is such an effective way of preventing wars, how did we get to be in this one?

Perhaps it’s something to do with the fact we refused to trade with Iraq for 12 years? As I argued previously, deliberately antagonising the government of a foreign country, without taking any effective steps to remove it, is a very bad policy. Is there any example in history of sanctions achieving any political goal, other than the goal of provoking a war? (and other interesting by-products)

I think it should be a rule of thumb: don’t introduce sanctions against a country unless you’re willing to fight it.