One of our data scientists discusses his experience finding data for different projects!
posted by Mosaic Data Science
In the course of several recent data science projects, I’ve been examining data providers external to Mosaic. It’s certainly not the most exciting topic, but questions often to seem arise that are structured something like “If only we knew [X], then we could do [something awesome]” Trying to make progress on these projects has led me to chase down some data. Here are a few notes on various lessons and providers that may be useful for others.
There is a difference between a data feed and a data A feed of live data can be tremendously useful for some purposes, but if no archive is being created (or elsewhere available) then this provides little for analysis work that needs to be done now. It is difficult to iterate (provide immediate value, fail fast, etc.) if you need to wait a few days, a week, or more to collect enough data to draw reliable conclusions. So, as this article continues, when I refer to a data source, I mean an archive.
Data costs money (to collect, to archive, to access, ), and there are rights and limitations associated with it. It is important to understand each of those points before trying to acquire or use external data.
For example, many in the Air Traffic Management business are accustomed to the various data sources available from the Federal Aviation Administration (FAA) and (perhaps to a lesser extent) the Department of Transportation (e.g., Transtats), including what data are provided, and the challenges associated with However, data archives are available online from most other Federal and state agencies. The interfaces to access these data vary from very confusing to almost intuitive. Sometimes these archives can be difficult to find on an agency’s website. A few that we’ve been using on recent projects include:
A number of data aggregation firms exist, providing access to a wide variety of data series aggregated from many.
A few of these are:
Another external data provider for air traffic data wold be masFlight. While we at Mosaic also archive some of what they collect, they do provide access to a lot of other data. Many of you have probably heard of them, but on a recent Volpe effort, we actually purchased some data about gate operations that proved invaluable. The data allowed us to conduct new analysis for the FAA customer, characterizing gate utilization in ways they did not previously have access to.
It is important to focus on identifying what data would actually be useful. It is very easy to begin searching for something general and end up spending far more time than necessary looking without focus. One way to aid this is to contact state and federal employees who are expert in the data – there are typically contacts list for each data archive, and we have had good success on data science projects contacting these people for information.