There's something about data - Why data gets messy

There's something about data - Why data gets messy

By Steven Orpwood, Business Analyst and DPO

March 2019


I’ve touched on address books before as they’re a nice easy way into concepts surrounding data management. When you first get one, all shiny and new, you add in your data, usually names, addresses, telephone number(s) and the occasional email address. Each contact appears in its own section, is separated from other entries by a ruled line, and contacts are grouped alphabetically by surname. These are the halcyon days, things are static and friends don’t change their address, or have children.
 
Then it happens; people move, change phone numbers, get mobiles, change mobiles, marry and have children. Since an address book is a primary source of all information, it makes sense to record all the changes, spouses and children’s names, in the book. Then it’s necessary to add birthdays, since friends seem to find a lack of knowledge that it is little Roxy or Kefalonia’s birthday, poor form. This is bad enough, but then you need to make sure you don’t repeat birthday presents (irritating, as I’ve rarely met anyone who’s complained about getting two bottles of the same wine off anyone).
 
So now, the pristine, ordered and accessible address book, has multiple entries, crossings out, some double addresses, since it’s easy to forget to do the crossing out bit, arrows pointing to information that wouldn’t fit near the original entry, and even extra bits of paper stuck in to accommodate the increasingly complex mass of data. On the bright side, everyone is getting older, so death will have a cleansing effect…
 
The situation is no different in databases, additional data is added, things change, new datasets are incorporated, and suddenly, your once pristine data is looking a mess. It gets worse, my company merges with yours and we have two sets of employee data, contacts, accounts, invoices etc... Nightmare, what do we do? Well, we can keep them separate, or we can merge them. How tricky can that be? The answer, you’ll be surprised to hear, is that it can be like a polar bear, i.e. looks wonderful, you want to cuddle it and ask why don’t people have them as pets, but one wrong move and it’ll rip you to shreds. Merging data can do this, not literally, but you might suffer a few sleepless nights.
 
The question is ‘why is this so difficult?’. At a simple level, if I ask two people to create contact databases, the equivalent of my address book, the likelihood is that whilst the data to be stored will be similar, name, address, telephone etc., the way it’s stored will be slightly different. For example, Organisation A might have forename, middle name, surname, address lines 1, 2, and 3, postcode, home telephone and mobile. On the other hand Organisation B might have the same name columns, but have four address fields, plus postcode, and home, mobile and work numbers. That’s not a problem is it?  Well it might be, first they may use a different forename, so one might have Katherine, and the other Katie, and this could be an issue when merging the data. Secondly, my address 3 might always contain the town, but they might have some towns in address 3 and some in address 4, which makes getting one column with just towns in it a trifle taxing. Thirdly, A’s interpretation of home and mobile numbers may be different to B’s, so A might class all mobile numbers as ‘mobile’ whereas B might class them as home numbers if there’s no landline; again, a nightmare to pick apart.
 
This is a simple example, with only a few fields to merge, but what if you have two databases with hundreds of fields, what if you classify rows with IDs and each database has a different ID, what if the format of similar fields is different, for example, one telephone number is formatted with an international dialling code, and another with only area code? Perhaps you can manually sort it out and decide how to proceed, but it would be very time consuming. Now what happens if you have huge datasets from multiple sources, each with millions of records and you’re unable to complete a manual review, how do you move forwards? This is a question that’s been on our minds for some time; in fact, it’s not just merging, since this is only one aspect of data management, but cleansing that data, i.e. removing duplicates, and transforming it, all in one go. It took two years of thought and development but we came up with a solution, which works on many levels, in fact, it’s surpassed even our expectations and major organisations in the card payment and pensions industries are using our tool. It’s called dataBelt, and we’re happy to show everyone what it can do.  If you would like to know more, contact us at here.

                             

Merging not an option 2.png