Monday, February 08, 2010

24 years of email

I first got an email address with an Internet @ in it in 1986. It was jgc@prg.ox.ac.uk, or for those of you on JANET it was JGC@UK.OX.AC.PRG (happily I only briefly used bang paths). In 24 years I think there have been three major end-user innovations: address books, MIME and email searching.

Address Books

Initially, I didn't need an email address book. Most of the people I was emailing were on the same domain (often the same machine) and so everything after the @ was irrelevant. And the number of people on email world-wide was so small that remembering their email addresses was easy (I don't mean remembering them all, just remembering the ones I needed to talk to).

And most people's domains hadn't reached the point where just using initials was unworkable. So most email addresses consisted of their initials. That made them short and rememberable. I don't recall anyone with a ridiculous address like john.graham-cumming@prg.ox.ac.uk.

But things changed: the Internet got bigger, people's addresses got more complex, I was communicating with more and more people. Hence address books.

MIME

The ability to send more than just plain text inside an email (even if it is actually being transmitted as 7-bit ASCII) was big. Prior to the introduction of MIME in 1992 there were some limited ways to send binary content in email (mostly using uuencode) but it was an ugly mess and mail clients often didn't know what to do with the contents and you were forced to save the mail to a file and manually unpack it.

Happily, MIME made that problem go away.

Email Searching

As email got considerably more widespread it became necessary to put it into folders to try and keep a handle on the volume. This led to the sort of trees of folders that are seen in programs like Microsoft Outlook. This is, IMHO, a less than optimal solution. The right solution is the sort of high-speed email searching offered by Google Mail. With it folders are completely irrelevant.

In fact foldering was such a pain that it was part of the reason I invented POPFile.

The Bad

Two bad things have happened since I started using email: spam (first spam was in 1978 on ARPANET, but I don't recall any unwanted messages during the late 1980s at all) and HTML email. HTML email has been a spammers playground and for messages I want to receive (i.e. everything other than marketing) it's almost useless.

Minor irritations are: vacation responders, people who don't edit replies sending me gigantic threads embedded in a message.

8 years

That's one major innovation every 8 years. With Google Mail being released in 2004 we've got another 2 years to wait for the next one. What do you think it will be? For me it has to be something to do with threading. That's still pretty messy, and Google Wave doesn't seem to have improved it. I don't think the little > is cutting it anymore.

Labels:

Sunday, February 07, 2010

Something odd in the CRUTEM3 station errors

Out of the blue I got a comment on my blog about CRUTEM3 station errors. The commenter wanted to know if I'd tried to verify them: I said I hadn't since not all the underlying data for CRUTEM3 had been released. The commenter (who I now know to be someone called Ilya Goz) correctly pointed out that although a subset had been released, for some years and some locations on the globe that subset was in fact the entire set of data and so the errors could be checked.

Ilya went on to say that he was having a hard time reproducing the Met Office's numbers. I encouraged him to write a blog post with an example. He did that (and it looks like he had to create a blog to do it). Sitting in the departures lounge at SFO I read through his blog post and Brohan et al.. Ilya's reasoning seemed sound, his example was clear and I checked his underlying data against that given by the Met Office.

The trouble was Ilya's numbers didn't match the Met Office's. And his numbers weren't off by a constant factor or constant difference. They followed a similar pattern to the Met Office's, but they were not correct. At first I assumed Ilya was wrong and so I checked and double checked has calculations. His calculations looked right; the Met Office numbers looked wrong.

Then I wrote out the mathematics from the Brohan et al. paper and looked for where the error could be. And I found the source. I quickly emailed Ilya and boarded the plane to dream of CRUTEM and HadCRUT as I tried to sleep upright.

Mathematical Interlude

The station error consists of three components: the measurement error, the homogenisation error and the normal error. The first two are estimated in the paper as 0.03°C and 0.4°C respectively. The normal error is calculated from the standard deviation information in the station files.

The formula for the normal error for a single month, i, is as follows:



Unfortunately, the paper uses rather sloppy mathematical language because the N on the left is not the N on the right, the subscript i isn't defined, and so I am going to express this a bit more clearly as follows:



This means that normal error for month i is the standard deviation for month i (that's σi) divided by the square root of the number of years used to generate the normal values in the station files (which I call mi). Typically we have:



because 30 years of data from 1961 to 1990 are used. In cases where less than 30 years are available (because of missing data) then a number less than 30 is used.

Now to get the station error, εi, the three error components are joined together by quadrature as follows:



That works for any grid square where for any month there's just a single station reporting a temperature, but in general there are more. So when there are many station errors they are averaged using a root mean square and then divided by the square root of the number of stations.

Suppose there are n stations each with a station error εi,j (to which I've added the subscript j to differentiate them) then the final station error for a month i is as follows:




Return to narrative

What Ilya had discovered was that the formula above (from the paper) works only when there is a single station in a grid square. When there were two or more it failed; that's when he approached me asking for help.

What I discovered at the airport was that if you replaced the number 30 with 15 the formula worked and the values for station errors for grid squares containing exactly two stations were now correct.

Both Ilya and I came to the same conclusion that in fact the number 15 wasn't picked from thin air, but in fact was 30 divided by 2 (the number of stations in the grid square). We both tested this hypothesis on squares with more than two stations and found that it worked.

So it appears that the normal error used as part of the calculation of the station error is being scaled by the number of stations in the grid square. This leads to an odd situation that Ilya noted: the more stations in a square the worse the error range. That's counterintuitive, you'd expect the more observations the better estimate you'd have.

Examples

Ilya had shown me an example in 1947, but I didn't want to take his word for it (although he later showed me a program to check all the stations errors so I should have believed him), and so I took a look at three locations in January 1850. For these three locations all the data underlying CRUTEM3 had been released:

1. The grid square which consists of the single station 723660: this corresponds to the grid square with corner 35N, 105W. Here the Met Office data gives station errors of: 0.5072 0.5424 0.4857 0.4962 -1e+30 -1e+30 0.4407 0.4407 -1e+30 0.4756 -1e+30 0.5186. The strange negative numbers are missing data (it's missing because in the underlying file there are no normals for 1850 in those months, although the actual normals aren't needed for the station error calculation so it doesn't matter). Using the formula from the paper give the correct answer: 0.5072 0.5424 0.4857 0.4962 0.4486 0.4756 0.4407 0.4407 0.4661 0.4756 0.5072 0.5186. This makes sense since our correction value of 1 for 1 station in the square doesn't change the formula.

There is, however, something else wrong with this. The paper says that if less than 30 years of data are available the number mi should be set to the number of years. In 723660 there are only 17 years of data, so this station error appears to have been incorrectly calculated based on 30 years.

2. The grid square which consists of the two stations 753041, 756439: this corresponds to the grid square with corner 35N, 80W. Here the Met Office data gives station errors of: 0.6168 0.569 0.5452 0.4008 0.4345 0.3642 0.3373 0.353 0.3881 0.4624 0.4076 0.5767 and using the formula from the paper (without our correction): 0.4801 0.4496 0.4346 0.3472 0.3669 0.3264 0.3116 0.3202 0.3399 0.3836 0.3511 0.4545. If a correction of 2 is used so that each σi is divided by the square root of 15 instead of 30 the correct values are generated.

3. The grid square which consists of the four stations 720388, 724080, 756192, 756490: this corresponds to the grid square with corner 35N, 75W. Here the Met Office data gives station errors of: 0.5073 0.4409 0.4329 0.3361 0.3286 0.2905 0.2712 0.2807 0.2973 0.3739 0.3325 0.4613 and using the formula from the paper (without our correction): 0.3074 0.2807 0.2775 0.2417 0.2391 0.2264 0.2204 0.2233 0.2286 0.2552 0.2404 0.2887. If a correction of 4 is used so that each σi is divided by the square root of 7.5 instead of 30 the correct values are generated.

Conclusion

I have no idea why the correction given in this blog post by Ilya and I works: perhaps it indicates a genuine bug in the software used to generate CRUTEM3, perhaps it means Ilya and I have failed to understand something, or perhaps it indicates a missing explanation from Brohan et al. I also don't understand why when there are less than 30 years of data the number 30 appears to still be used.

If these are bugs then it indicates that CRUTEM3 will need to be reissued because the error ranges will be all wrong.

I've emailed the Met Office asking them to help. If you see an error in our working please let us know!

Labels:

Wednesday, February 03, 2010

A compliment from The Times

The Times has kindly mentioned this blog as one of its Top 30 Science Blogs saying:

John Graham-Cumming is one of the few people out there who makes the nuts and bolts of computer programming actually sound interesting. Expect anything from an analysis of the statistical likelihood of election fraud in the last Iranian election to the unveiling of flaws in the Met Office’s global climate models.

Thanks!

Labels:

Friday, January 29, 2010

New version of CRUTEM3 and HADCRUT3

There's a new version of the Met Office land surface temperature record out with lots and lots more stations.

Plus it includes corrections for all the problems I found with the data (they didn't make good on their promise to acknowledge me, sadly).

But my handiwork is shown by the points in green:


My two corrections: A and B.

I'll run these through my own programs to see what they produce.

Labels:

Thursday, January 28, 2010

John's Amazing Diet Secrets Revealed!

Now, at last, I can reveal the top diet secrets that doctors have been keeping from you! Yes, this is how I lost an AMAZING 9.9kg (21.8 pounds) in just 6 months doing absolutely no exercise at all.

Put down those weights, step off the StairMaster and follows these amazing simple steps to a better figure:

EAT LESS, EAT WELL

In April 2008 I decided that my 82.5kg (181.9 pounds) was too much for my height. Ideally I should have been 72kg (158.7 pounds), so I needed to lose 10.5kg (23.1 pounds) to be my ideal weight.

It's this image of me that really decided me to lose weight. Nothing like the shape of that stomach in public.

So, I read up on dieting and digested The Hacker's Diet. I ignored almost all the advice except for one thing: you can stuff your face with calories far faster than you can burn it off. That revelation lead me to the Colarie.

A single can of Coke is about 147 calories. How much exercise does it take to burn that off? Well just running it would take something like 15 minutes. And I thought why run? Just don't stick the stuff in your face in the first place.

So, I eliminated all soda and drank cold water instead. I also stopped ever eating sweets (candies). But that wasn't enough, I decided that I needed to eat less.

So, I simply looked at whatever food I was eating and ate about 50% of it and left the rest. I didn't order desserts, I stopped having sugar in coffee.

And, boy, was I hungry at first. After a while I wasn't hungry any more and I think I enjoyed the taste of food more. After a while the amount of food I was eating was simply less than before and I was no longer forcing myself to stop.

That was helped by eating slowly. By doing that I discovered when I was full and stopped eating. I didn't stop eating particular foods (other than the particular nasties like Coke, desserts and sugary snacks), I ate a wide variety of things (including foie gras and other delicacies). I just ate less of it.

The weight fell off me. Here's the magic chart:


I never quite made it to my goal, my weight stabilized at around 72.5kg (158.7 pounds) and remained there. Now whenever it goes up a bit I know what to do.

(BTW I'm not saying don't exercise, there are lots of good reasons to do exercise, but I don't think weight loss is one of them. Do it because you enjoy it, do it because it clears your head, etc.)

WEIGH YOURSELF

Weighing yourself is vital. I made it a ritual so that I got consistent results. Your weight will vary throughout the day so I had a simple technique:

1. Same time, same day each week: I weighed myself on Saturday mornings, immediately after getting up (after visiting the bathroom) and before eating.

2. Same situation: I always weighed myself naked so that there were no inconsistencies.

3. Once: I weighed myself once on that Saturday morning and recorded the result. I didn't fret about the result. I didn't reweigh myself.

KEEP TRACK

I kept all the weight data in a spreadsheet. From that I could draw an encouraging chart, and calculate how many more weeks were needed for me to reach my goal. And I could calculate my BMI (which seems like a bogus figure to me).

That's it: eat less, eat well, keep track.

PS I did do some exercise. During this period I didn't own a car and walked everywhere or took public transport. I was working at home and frequently would walk outside to get some sun and clear my mind.

Labels:

Tuesday, January 26, 2010

£1,000 for Bletchley Park thanks to The Geek Atlas

When The Geek Atlas was published in June 2009, O'Reilly's UK arm decided to pledge to donate 50p per copy sold in the UK to help fund Bletchley Park.

O'Reilly has now made good on that pledge and with almost 2,000 copies of the book sold in the UK it has donated £1,000 to Bletchley Park.



And the 50p per copy pledge continues. All copies of The Geek Atlas sold in the UK result in a 50p donation to keep this wonderful place alive.

Labels: ,

The squawking Squaw King was stabbed in a stab bed

Yesterday I tweeted: Realized that 'assisted' is 'ass is ted'. Are there other non-compound words in English which consist entirely of other words? and people replied with is land and cut lass.

Naturally, I couldn't resist writing a small amount of code to figure out other word sequences within words. Using a short program and a 57,000 word English dictionary of common words I had the answer: 12,870 words. That means 23% of English words have this property.

Of course, many are rather boring because they are just compound words. But others are more fun:

I have secretions of secret ions and a seepage (see page 21), but sematically my sematic ally says I am fatalistic and fatal is tic bite. But the fellow fell, ow! And asked, do we seal ants with sealants? I went to the palace to see my pal (ace) and said "Serge! Ants". He called for "Sergeants!". But with an antelope the ant elope.

I smelt an aroma: the rapist! Yet, it was just the aromatherapist.

You can get the full list here.

Update Here's the code

# ----------------------------------------------------------------------------
# Small program to find words that consist entirely of other words
# concatenated. An example is 'fatalistic' which is 'fatal is tic'
#
# Written by John Graham-Cumming
# ----------------------------------------------------------------------------

use strict;
use warnings;

# The first argument to the program is the filename of a dictionary of
# words, this dictionary will be searched for words consisting of word
# sequences. It should be simply one word per line.
#
# It is loaded into the %words hash.

my $dict = $ARGV[0];
my %words;

if ( open F, "<$dict" ) {
while (<F>) {
chomp;
$words{$_} = 1;
}
close F;
} else {
die "Cannot open dictionary file $dict\n";
}

# Check every word in the dictionary using the recursive function
# check_word. Note that I don't sort the words here since that might
# take a long time. Sorting can be done on the output.

foreach my $w (keys %words) {
my $sub = check_word($w);

if ( $sub ne '' ) {
print "$w ($sub)\n";
}
}

# check_word extracts ever longer subsequences of the word to be
# checked and sees if they are themselves words (by checking in
# %words). If a word is found then the remainder of the word is sent
# to a recursive call to check_word.
#
# For example, suppose we do check_word( fatalistic ), the code will
# check the following:
#
# check_word: fatalistic; found so far:
# f?
# fa?
# fat?
# check_word: alistic; found so far: fat
# a?
# check_word: listic; found so far: fat a
# l?
# li?
# lis?
# list?
# check_word: ic; found so far: fat a list
# i?
# listi?
# al?
# ali?
# alis?
# alist?
# alisti?
# fata?
# fatal?
# check_word: istic; found so far: fatal
# i?
# is?
# check_word: tic; found so far: fatal is
#
# This function returns an empty string if the word does not consists
# of other words, or a string containing the word broken down into
# space separated words
#
# e.g. check_word('fatalistic') returns ' fatal is tic'
# check_word('potato') returns ''

sub check_word
{
my ( $w, # The word to check
$depth ) = @_; # Contains the words found so far, or
# undefined when first called

if ( !defined( $depth ) ) {
$depth = '';
} else {
if ( defined( $words{$w} ) ) {
return "$depth $w";
}
}

for my $i (1..length($w)-1) {
my $fragment = substr($w,0,$i);
if ( defined( $words{$fragment} ) ) {
my $sub = check_word(substr($w,$i), "$depth $fragment");
if ( $sub ne '' ) {
return $sub;
}
}
}

return '';
}

Labels:

Saturday, January 23, 2010

Price drop on GNU Make Unleashed

I've dropped the price on GNU Make Unleashed to €15.00 (for the printed book) and €10.00 (for the PDF).



And I'm working on a version for the Kindle.

Labels:

A not very illuminating reply from the Met Office

On the 15th I posted about six additional stations that can be used with the Met Office land surface temperature record. The Met Office kindly replied to my query about the six saying that they could be used despite the missing standard deviations.

I followed up with this query:

Thank you. I'll post on note on my blog with your reply.
Was there a particular rationale for using 16 instead of 15?

They have now replied. Unfortunately, the reply doesn't really shed any light on the situation because they don't say why 16 vs. 15 just that they were calculated separately:

Thank you for your email.

The normals and standard deviations were calculated separately and the limits (15 years and 16 years) were set independent of one another.

I wonder why? Brohan et al. 2006 clearly says 15 is the limit:

(the requirement now is simply to have at least 15 years of data in this period)

I just might be possible that the Met Office isn't telling my why because the why could be a bug. Since there are two programs this could be one of those classic off by one errors that crop up in programming all the time.

It's not hard to imagine the normals program doing

if ( number_of_years >= 15 )
...

and the standard deviation program doing

if ( number_of_years > 15 )
...

Equally that's almost groundless speculation on my part and perhaps there's some other good reason that the Met Office decided not, or hasn't taken the time, to tell me about.

Either way the six stations can be used.

Labels: