Did you know that NOAA (the National Oceanic and Atmospheric Administration) makes ALL of their local weather data available for download? It’s an astonishingly thorough dataset going back years and showing hundreds of stations with multiple sample intervals.
Through the National Centers for Environmental Information, you can download datasets for specific locations or the entire country with monthly, daily, and even hourly readings. While reading up on the data online, I came across this question on Stack Exchange. The issue is that the data has a field for WeatherType and one for HourlyPrecip, but sometimes, the first shows rain while the latter shows no precipitation. You would think that this would be impossible. As it turns out, this is not actually an error, but it illustrates a fairly general type of problem you may want to filter out of your data (i.e. apparent inconsistencies between variables).
I went with the hourly file for the entire country in November of 2016. It weighs in at a modest 3,978,056 records, each with 44 fields. If you’re following along, I’d recommend turning on caching.
In addition to the problem mentioned above, I wanted to check if we might be missing hourly reports from some of the stations. If we are, there will be different numbers of records for each station’s unique ID (the WBAN field).
Without further ado, here’s the model:
The four output data nodes correspond to the answers we are seeking:
- Output_1 – The number of reports per station
- Output_2 – The station with the highest number of reports with location data merged onto the record
- Output_3 – The average number of reports per station
- Output_4 – Those reports which show rain but no hourly precipitation
Here is the data in Output_1, and right away we see there is an issue:
Every station has a different number of data points. As it turns out, this is because “hourly” actually means “at least once per hour, maybe more”. If we take a look at Output_3, we see that on average each station reports 1727 hourly readings (for November anyway). Note that there are 720 hours in November. Why all the extras? A lot of the stations actually report every 15 (or fewer) minutes. That wasn’t immediately obvious to me when I found the data. Often, this sort of thing is assumed, and further analysis is done based on these (incorrect) assumptions.
Output_2 shows us the stations with the most reports (8053 of them):
By adding a simple filter to the original model, we can see that the station in Elkins reports every 5 minutes:
Now back to our original question: which data points show rain but no hourly precipitation? Let’s take a look at Output_4:
We end up with 64,468 records that fit this description, and while we do know that in this case it’s not actually an error, this is the sort of error that is easy to overlook in data. Maybe instead of rain information, the fields pertain to whether a particular object has been shipped and whether it arrived. It’s easy to assume that these are always the same, but if that isn’t the case, incorrect and very costly business decisions can happen. -Andy