Online:
Visits:
Stories:
Profile image
By Sunlight Foundation Blog (Reporter)
Contributor profile | More stories
Story Views

Now:
Last Hour:
Last 24 Hours:
Total:

Exploring strategies for cleaning messy data

Monday, May 18, 2015 22:11
% of readers think this story is Fact. Add your two cents.

(Before It's News)

By The Sunlight Foundation

A lighthouse watches over dark water at night.

Data, data, everywhere, and all the nerds did think; data, data, everywhere, yet nothing with which to link. (Photo via John Crowe/Flickr)

Thanks to the efforts over the past few decades of the open government
community, a large and
hard-won
group of government datasets has been collected and
made publicly available. It’s inspiring to
look up from the day-to-day grind of opening up government data to see how much
progress has already been made. Now, though, we must bear the burden of our
collective success, and recognize that we’ve created an unruly menagerie of data
sources with many related, but unrelatable, datasets.

At Sunlight, this means that
we’re consolidating many of the related but separate projects that have sprung up over the years. We’re applying all that we’ve learned from
the dozens of projects we’ve done to
provide a unified experience. The public should not have to search a dozen
different databases in order to find what the information they seek. Just as no
man is an island, information cannot have meaning outside the context of its
collection and environment. We aim to provide fast, easy and meaningful context
to government affairs.

Over the past year, we’ve been working on taming these messy data by testing and
validating new ways of moving and
representing data. As we’ve been figuring out how
to effectively consolidate our data, we find ourselves facing the same problem
time and time again. It’s a basic issue that runs deep, seemingly without any
easy fix: The datasets we collect don’t have reliable identifiers associated
with each person or organization mentioned in the data. There is nothing
equivalent to a social security number that allows data collectors to reference
the same entity across datasets (or even consistently within
the same dataset).

Bootstrapping authority

We must act as curators, creating reliable identifiers ourselves, making
decisions about which identifiers each piece should get and managing those
identifiers in the face of changes to the content and format of incoming
data. We’re forced to move beyond finding, liberating and publishing data. We
must use all the data we have to provide context for every piece of data we
have. There is no authority on the data as whole, so we’re
forced to rely on ourselves and start up the process from scratch.

Thankfully, we are not the only ones who’ve had problems such as these. As long
as there have been databases, there have been database integrity problems. As we
started Googling around, we ran across field after field, specialization after
specialization, tool after tool that seek to redress every variation of the
above problem we could
imagine. Entity resolution,
record linkage,
householding and many other
academic fields were all created to address this issue. Background checks,
counterterrorism efforts and fraud analysis all depend on these techniques to
find the important data hiding in the mountains of messy data. The U.S. Census
Bureau has been using advanced statistical techniques for decades to make
sense of the data it collects. In short, as we researched these issues, we found
ourselves in interesting, varied and, frankly, unexpected company.

What’s next

Although it will still be several months before we can point to projects where
we use these techniques, we’ve ran across enough interesting ideas,
projects and efforts that we feel compelled to share some of the things we’ve
found. From talking with others in the open government community, we know that
others have felt our pain and are looking for their own solutions. Our solution
surely won’t be the same as everyone else’s, but each of solutions will likely
all share some common traits.

Over the summer, we’ll be blogging about research, companies and problems we’ve
come across in our work in entity resolution that we’ve found especially
interesting. The issues are necessarily technical, but we aim to keep the
explanations from being overly technical. We aim to build a lighthouse of ideas
for others trapped in the confusing fog of messy data. No one should have to
navigate the stormy seas of government data alone — and we hope that these posts
will help you find your way to wherever you are headed.

The Sunlight Foundation is a non-profit, nonpartisan organization that uses the power of the Internet to catalyze greater government openness and transparency, and provides new tools and resources for media and citizens, alike.



Source: http://sunlightfoundation.com/blog/2015/05/19/exploring-strategies-for-cleaning-messy-data/

Report abuse

Comments

Your Comments
Question   Razz  Sad   Evil  Exclaim  Smile  Redface  Biggrin  Surprised  Eek   Confused   Cool  LOL   Mad   Twisted  Rolleyes   Wink  Idea  Arrow  Neutral  Cry   Mr. Green

Top Stories
Recent Stories

Register

Newsletter

Email this story
Email this story

If you really want to ban this commenter, please write down the reason:

If you really want to disable all recommended stories, click on OK button. After that, you will be redirect to your options page.