Tuesday 20 May 2014

I've found it ! (again)


Data scientists like me spend a lot of time worrying about which type of model to fit and the best way of fitting it. 
As one example, we think about regression, its many pitfalls, and its arguable overuse in market research, and we look for alternatives that deal with issues such as collinearity, non-linearity in relationship, non-numeric dependent variables, non-‘normal’ predictors, and missing values, not to mention data files that are invariably imperfect (e.g. when a value of zero can mean either ‘0’ or ‘Don’t know’ or ‘the question wasn’t asked of that respondent’).
In fact, the frontier for modelling in market research is now very much associated with approaches from the data mining field, such as (you know I am going to say this) Random Forests and even Conditional Random Forests. 
Random Forests and similar approaches (which together fall into the category of ‘ensemble modelling’, which is more or less the quantitative version of Wisdom of Crowds) form the backbone of competitive attempts to predict the near-impossible, via the various competitions on Kaggle http://www.kaggle.com/.
In my search for alternatives, a few years back I came across the unfortunately spelt ‘Eureqa’ software, and have written about it previouly. 
Eureqa Desktop uses something called ‘Symbolic Regression’, and its mode of operation is essentially to conduct a search for the best model amongst all possible models.  That is, it doesn’t just say “Here is the model you’ve asked for and I will now calibrate it”.  It says “Here are a zillion possible different models, and I will calibrate them all and let you know which is the best one.” 
Eureqa Desktop is available from http://www.nutonian.com/ and is, unfortunately, no longer free. 
However, the Excel version is (at the time of writing) in beta testing, and is free.  And unlike Eureqa Desktop, Eureqa for Excel pretty much makes all the decisions for you (such as what functional forms to allow in the models that it examines).
I’m still to be convinced about Eureqa … but I do understand, for example, that if you input the co-ordinates of a swinging pendulum, Eureqa will eventually come up with Newton’s gravitational law as the best model for the data !
However in market research, we rarely have data sets that display the same degree of precision as a swinging pendulum, and finding a model that predicts satisfactorily, however good the software used, is probably always going to be at least a little problematic.
Ultimately it comes down to finding an approach that (a) may not be perfect but works well enough and (b) the outputs of which make sense, in the light of what else we know about the particular situation being analysed.
As the statistician G.E.P. Box wrote in 1979 … “All models are wrong, but some are useful.” 

Monday 5 May 2014

Something in the air ?


Like very many people in this industry, I have loads of stuff that I can’t afford to lose.

So the only option is to back it up, both frequently and in full.

What’s the best way to do that?  Well, you’ll all have your own strategies.

For example, one of my clients was telling me that she backs up all her computer-based records on to a hard drive at the end of every day, then one of the people in her office takes the drive home and looks after it overnight.  That way, by alternating between two or more hard drives, she is pretty sure of never being more than a day or two behind, if disaster strikes.

I do something similar, but instead using an NAS (network attached storage) hard drive (i.e. a 4Tb drive that is accessed through my office network) and backing up to that automatically every night at 1am.  The NAS drive is in a “secret” place, that no-one (I hope) will ever find.

But that’s not good enough.  So I’ve also been a user of cloud storage for a couple of years now, starting with Google’s service.  But ultimately, I opted for Microsoft’s ‘Onedrive’ service (the product formally known as ‘Skydrive’, but which was apparently the subject of an infringement action in the UK by Sky TV, so the name had to change).   I pay $USD100 per year for 200Gb of cloud storage capacity, which is a pretty fair charge, particularly when compared with Google’s fees for a similar, but lower capacity, cloud storage service.

Now, you’d think that Microsoft would have this pretty well-sorted. 

Unfortunately not.

For example:
  • Some Onedrive users (particularly Windows 8.1 users) have had massive problems in just trying to access a local copy of a file they have put onto the cloud service via syncing from another PC.
  • It seems also that a move to Windows 8.1 can mean that the service completely stops.  One of the solutions to this suggested by the Microsoft help desk was to back up your entire Onedrive directory, then delete it, then re-make it and start again, whilst at the same time saying the Microsoft took ‘no responsibility’ for any loss of data.  [Really?  I thought the whole purpose of cloud storage was so that you wouldn’t lose data.]
  • In my case, I have around 130Gb of data in my Onedrive storage, and re-syncing that to the cloud facility literally takes days.
  • Whilst you can upload files with path lengths greater than 255 characters, once you do it, when syncing, the service simply stops at the point where it encounters the ‘excessive’ path length, and the only fix is to log onto the service on-line, locate the offending file and path, shorten it, then re-start the syncing.  But if there is another path with the same problem, the procedure starts all over again.  Some users have reported having literally hundreds of long-path-name files which they have had to rectify one by one.
  • When you are not on a wireless connection, but instead on 3/4G, it seems that Onedrive becomes extremely reluctant to be helpful, on the grounds that it might use too much of your bandwidth.

·         Lastly, if you want further indication of difficulties encountered by Onedrive users, just search “skydrive-onedrive-error-files-cant-be-uploaded”, or similar.

Having said all that (sorry for the rant), when Onedrive is working, it works really very well, and pretty much seamlessly.

But also bear in mind that Microsoft has some pretty severe conditions for using their service … http://wmpoweruser.com/watch-what-you-store-on-skydriveyou-may-lose-your-microsoft-life/

I haven’t anywhere near the same experience with other cloud storage facilities, so can’t comment on those.

Regardless, and in summary, I think the moral is never to rely on just one form of back-up.  If I had to re-create even just this article, it would mean at least another hour out of my day.  Imagine if I had to re-create all of yesterday’s work?