From Lego to Play-Doh: I plead guilty at the altar of Big Data

Or: Why the geospatial community needs to shed old habits and learn new skills to take advantage of Big Data.

I just had a realisation. Sometime in the recent past I must have crossed a threshold. I don’t know exactly when it happened, or how. It was a kind of intuition or common sense, taking small steps, making gradual changes, and suddenly I’ve woken up and seen what I have done: I no longer believe in the classical doctrine of old-fashioned GIS and data management.

Looking around my department, staffed entirely by data experts and other tefal heads, it suddenly dawned on me that I was no longer engaged in what I might have called ‘data’ a few years ago. Take our placement student who is doing a project on artificial intelligence. Analysing our real estate data he has shown that if you mention the word ‘radiator’ in a property listing, the house won’t sell. Figure that! Then there is the timeline of the British economy, brilliantly visualised by my in-house guru: 10 years’ worth of housing market data, condensed into dots and bubbles performing a beautiful choreography on a five-dimensional stage. Millions of data points made entirely comprehensible in one clip. Amazing. The list goes on.

I hate to admit it but I’ve joined the latest bandwagon: Big Data. But it’s not just the ‘big’ that is different.

Like many others in the geospatial industry, I had grown up with the notion that the world was there to be abstracted, structured, ordered, and modelled with great accuracy. When I entered the industry in the late 1990s, GIS and relational databases were state of the art. People talked about how spatial data infrastructures would create virtual representations of everything that exists in the world. The digital nirvana was near.

Lego. Picture by Dunechaser (Flickr CC)

When the nirvana finally arrived it didn’t quite look like some people had imagined. Instead of the Legoland which some had expected – a stack of bricks, neatly built from the ground up – it looked more like a pile of Play-Doh balls: amorphous, gooey, messy. Google Earth, for example, was great fun but it didn’t take long for many of us ‘serious’ professionals to dismiss it as eye candy. Not accurate enough. Not the right projection. Bad rubbersheeting. Poor attributes.

Play-Doh. Picture by daniella_caterina (Flickr CC)

Critically, creating spatial data had suddenly become as easy as composing an email, and so KML files began to rain down like a hail storm announcing the arrival of a tornado. People started annotating the maps with random snippets of data without any concern for relevance or quality. Purists who had dedicated their lives to structured order were feeling exasperated: Where are the standards? Where is the metadata? To which most non-initiated people said, meta-what?

The truth is that mapping had simply become a more realistic, a less abstract representation of the world. But hey, we said, we still do the heavy lifting, and we have the degrees to prove it. Who else knows about geodetic datums or spatial intersections? Of course. Not like those lightweights at Google who manage petabytes of data requiring so much energy, they need their own power stations.

The trouble with the old data world is that the only perfect database is an empty one. This is because the world is not perfect, not regular, not linear. It’s a kind of chaos so vast that even its randomness creates patterns  – a bit like those Mandelbrot fractals that were so popular with the first PCs in the 1980s.

If you are a classically-trained spatial data professional like me, don’t let your well-honed perfectionism get in the way of your next ride. Big Data is here now. Take a deep breath and accept that quantity will eventually trump quality. And when the quantities are huge, the insights can be many. In the world of Big Data, your job is not about structuring or managing data. It’s about telling stories.


9 thoughts on “From Lego to Play-Doh: I plead guilty at the altar of Big Data

  1. Great post…I had this reality smash me in the face recently whilst working on a large modelling project but was too dazed to realise, I was focused on geospatial accuracy whilst what was required were assumptions and smart data analytics…got there in the end but it hurt!

  2. Great post Thierry. Interestingly Big Data is really giving some momentum to some ole skool data storage concepts (which never really took off) like Columnar storage and Object stores. We’ve been looking at Vertica which is just starting to support spatial types. I get the feeling there’s going to be an almighty bun fight in the Big Data space.

  3. Great stuff. Can’t tell you how many mid-career GIS types are wondering “is this as good as it gets?” with the traditional geospatial career trajectory. Oddly enough, it may be Big Data and Information Visualization that gets us excited again by spatial analysis and cartography.

    On a related note, I firmly believe R is the next great GIS software behemoth.


    • R as a solid spatial analysis platform? Right on!. R is robust, adaptable, scalable and all those other IT-necessary words. Also, MapReduce, Hadoop, Node.js, Apache Pig, Couch.db and a bunch of other Open Source platforms do a wonderful job of managing and analyzing spatial data, and here’s the important part, as part of the rest of the platform.

  4. The terminologies and buzz words are a bit much in reality… though they are probably good marketing tools. Big data has always been there, people have been analysing them for decades perhaps it hadn’t been centralised so well, but in many cases it has e,g a census. We used to call them a big dataset, or for some VLDBs, we used SQL, c code, even just basic statistics and the adventurous used machine learning, neural networks…multi-level modelling. These people don’t see it as spatial this, or gis that, or geomatics the other… to them its geography and they’ve been doing it forever! And whats big today, wont be big tomorrow it’ll be small – so follow that and big data has always been there. So i’d say its not so new, but there are now wider opportunities dawning on a greater variety of big datasets – eyes are opening in a wider group of industries to the latent value or the value that is emerging. Usefully, the bigger the dataset, the more relevant statistical significance and trends become… probability comes into play over anal inspection of data accuracy. In many cases these data have sat fallow because the skills and the imagination in the work force are lacking to exploit it. The need to see through the noise and create justifiable, understandable well communicated facts of life, of the earth.. of geography, of relevance to people and business – is there! The recognition that the skillset to achieve this is required is sadly lacking, it should be fostered over dishing out yet another software training session, or hours trying to achieve 100% data excellence. Most posts I see anywhere on spatial stuff… are about databases, software, web apps and technologies which you know will change before the important bugs are ironed out, and the chat is transient and old in a year. Most manifestations of these technologies don’t really portray geography.. just where stuff is.

    Fundamental appreciation and understanding for statistics, visualisation techniques and information management will persist and pay dividends for the individual and company time and time again. They need to be on the job description. Give me someone that can truly analyse a dataset, that knows the world is chaotic (fact), understands statistical significance and can learn to use tools… and i’ll be happy. Slap on top understanding of the importance of good librarianship (another old discipline).. and you have yourself a good platform that provides value upon value… I tell you – the lack of this appreciation of geography over coordinates seriously effects my mojo.

    IN short, I agree with this wholeheartedly, hindsight and the future will prove it. The effort pyramid is flipping, the fat end is analysis.

  5. At IDC, we’re seeing a lot of confusion about what “big data” is. Right now, the term is in the early hype cycle phase, so it means too many things to mean anything in particular. But, there are three approaches for making sense of high-volume data that are easier to quantify: in-stream analysis, complex event processing (CEP), and large-volume data analysis.

    In-stream analysis generally happens in low-latency data appliances or down in routers. In-stream analysis seems to work best for more or less stateless event data. CEP can happen in low-latency data appliances or beefed-up databases. CEP apps are better at managing stateful data. But, if you need to munge together deep history with event data, or just analyze very large mixed data sets, then tools/platforms like MapReduce, Hadoop and Couch.db and R are what you need.

    All those tools/platforms handle spatial data just fine, but they ain’t your father’s GIS. If you want to bend your mind a bit further, take a look at TomTom’s OpenLR initiative. That standard, lets you generate a location reference plane from high-volume, near real-time data — without any of the hassles of a “base map”.

    This is all pretty new stuff, but as William Gibson said, “The future is here. It’s just not evenly distributed.”.

  6. Thank you for all your comments, plus the many reactions & RTs on Twitter. It’s a fascinating subject. I agree it’s still early days and ‘big’ data can mean almost anything, but there is clearly a fundemantal shift occurring in the (data) industry. Watch this space! Thanks again, Thierry.

  7. Pingback: When does more geospatial data beat accurate data? | Spatial Sustain

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s