Data Science: Finding the Petoskey Stone on the Beach
On a recent trip to Northern Michigan to visit my parents, I was walking along the beach looking at the many different types of rocks on the beach and I started thinking about data science. There are typically three stones I look for:
- a petoskey stone — a stone shaped fossil made up of hexagon patterns
- beach glass — not technically a stone but instead worn down pieces of glass from bottles thrown into Lake Michigan a long time ago
- Leland blue stone — again not technically a naturally occurring stone, but instead a byproduct from iron smelting processes that took place more than a century ago
Finding these three stones are somewhat rare, but with enough concentration and time, I can normally grab a couple during a walk. I’m able to find them because I’ve studied what they look like — they all have their own special features (patterns, colors, reflection) and I’ve built a model in my head that helps me quickly scan which rocks fit my model and which do not. Despite the tens of thousands of other rocks on the beach, I was able to use my model to find a couple of each of these stones.
If I only was looking for these three rocks, though, I would miss the many other rocks that also are special on their own. Using some of the same filters for color or reflection, I also notice other stones, even if they are not my favorite three. On my recent trip, I found a couple of almost transparent white rocks and a very solid, smooth black rock. None of them were even remotely like my favorite three rocks, but I picked them up anyway.
So what does finding rocks have to do with data science? Often times, we try to tackle problems where we have huge dataset, but are only looking for a couple of specific results — which property might be at risk of having a fire? Where might a water main break? To accurately categorize and find these targets, we need to have a good sense for the history of the data and what the data means— what attributes caused previous water mains to break or properties to catch on fire? The model we develop will not always be perfect — I pick up a lot of stones that I think might be petoskey stones, but with a closer risk I see that I was wrong — but it will be useful and better than picking at random.
When trying to find a specific target, in the exploratory data analysis phase of building a model, we may find some attributes that we didn’t expect would be valuable. We might incorporate those values into our model — maybe the makeup of the soil impacts water main quality. In the stone example, I ended up finding interesting looking rocks that I wanted to keep even though they weren’t my favorite kinds of rocks. In a data science example, we might be able to turn up additional targets or even answer broader sets of questions in the future.
Calm walks on the beach also serve as a good reminder that when your mind is relaxed and not focused on many different priorities, it is often easiest to find something special. This likely translates to data science as well.