COVID-19 Data Management Challenges

7 min readNov 23, 2020

COVID-19 has pushed governments to quickly adopt data-driven approaches to policy making and messaging to community members. In New York State, it has been exciting to watch the Governor, County Executive, and Mayor all regularly reference dashboards in press conferences and discuss using data to direct regulations and guidance.

The pandemic has impacted everyone across the world and there are a number of different places to find data or dashboards publicly. If you live in Onondaga County, NY like I do, you could go to:

The County’s dashboard: https://covid19.ongov.net/data/
The state’s dashboard: https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Map?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n and Open Data Portal: https://data.ny.gov/browse?tags=covid-19
The CDC’s dashboard: https://covid.cdc.gov/covid-data-tracker/#cases_casesper100klast7days and Open Data Portal: https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data
NY Times dashboard: https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html and Data repository: https://github.com/nytimes/covid-19-data
Johns Hopkins dashboard: https://coronavirus.jhu.edu/map.html and Data repository: https://github.com/CSSEGISandData/COVID-19

With this number of data sources, it means there are a lot of different ways to find information. It also means that there is the potential for confusion over which data source is the truth, what the truth actually means, how governments make policy, and how those impacted by policies like lockdowns should understand the underlying data that informed the decisions.

Challenges

When looking through each data source above, the numbers vary slightly. Data across the different sources increase collectively when there is a broader case-rise across the community, though the numbers differ slightly. This is due to a number of different reasons:

Reporting times: positive results may be reported to one agency before it is reported to another agency, causing a lag
What a positive test means: if a person tests positive for COVID-19 and then gets re-tested a week later and is still positive, is that one positive case or two? How does this impact percent positive cases? Are we counting rapid tests or less accurate tests?
Reported tests: Are all tests being reported everywhere? Are some test sites only reporting to certain levels of government?
Are numbers updated retroactively as new information comes in?
Are positive tests reported based on the county in which the person lives or where they took the test (if different)?
Many other reasons…

Specifically, in the New York Times today, there is a story about the different calculations that New York State and New York City use and the impact the resulting metrics have had on policy decisions like the shut down of schools: https://www.nytimes.com/2020/11/22/nyregion/Coronavirus-cases-numbers-nyc.html#click=https://t.co/732pPnuj0i.

Locally, as of this week, Governor Cuomo has announced that parts of the County will be placed under “Orange Zone” restrictions, up from “Yellow Zone”. The data is not available granularly enough to the public to understand where and why these increased restrictions will occur: https://cbs6albany.com/news/local/watch-gov-cuomo-makes-an-announcement-holds-briefing-1122.

Governor Cuomo himself discusses the challenges of data management and metric creation during a recent press conference: https://twitter.com/NYGovCuomo/status/1330550950201876485.

If you look through the data sources and dashboards above, you’ll notice that the way data is shared differs across governments — the County has a dashboard but no underlying spreadsheets, but the dashboard does including information about Syracuse and even some zip codes. The State has a dashboard and spreadsheets, though it is limited to showing data at the County level. The CDC has a dashboard and spreadsheets, but is limited to county-level or state-level data. Within the spreadsheets, the column names suggest there might be subtle differences in the data we look at:

Test Date vs Date”— does one dataset report the date tested while the other is the date the information was reported?
Confirmed Cases vs Positive Tests — is confirmed cases a more scrutinized measure?

Many of these sources of data have some metadata that explains the methodology, though the answer still remains unclear, and additionally most people likely won’t read the metadata.

The challenge for people trying to live their lives within this pandemic, seeing different numbers reported and messages crafted based on those numbers is confusing. For policymakers, not having access to the same information across governments and having different processes for understanding the data without higher level federal guidance on best practices across the country, there is opportunity for vastly different approaches to regulation.

Though COVID-19 is new, the data problem is not. I’ve often said the first challenge for using data in government is knowing how to count correctly. After leaving my job as Chief Data Officer over the summer, and now working with clients on data projects across industries, I’ve found that these challenges exist at any organization. In government at global scale trying to respond and build up data infrastructure immediately and make quick and responsible decisions, this challenge is made exponentially more difficult.

Examples

The visualization above shows positive COVID-19 cases per day on a rolling 7 day basis from three different sources: Onondaga County, New York State, and the CDC. Looking at the visualization, you can see that they very closely mirror each other — when a surge happens, all three sources show that surge. But, there are slight differences in the underlying numbers. New York State (Orange line above) and CDC (green line above) show the same numbers, but New York State reports the data a day prior to the CDC. Onondaga County’s version of the data (the purple line) also shows a similar trajectory when positive cases rise and fall, but the data is definitely different day by day.

Over the course of eight months, these differences have resulted in:

CDC and NYS’s data differing by about .7 positive cases reported per day — over the past eight months the visualization below shows where there have been differences in reporting by agency on a rolling 7-day basis. Here, the green bars show the CDC reporting greater numbers on that day and orange bars show NYS reporting greater numbers that day.

CDC and County’s data differing by about 1.4 positive cases reported per day. For the visualization below, the green bars show when the CDC has reported more cases per day. The purple bars show when the County has reported more cases per day.

NYS and County’s data differing by about 2.2 positive cases per day. In the visualization below, the orange bars show the cases where NYS reported more cases that day. The purple bars are when the County reported more cases that day.

While overall the numbers remain close — only off by a few cases per day — in the NY Times article above, slight differences in COVID-positive test percentages can have big impacts on the policies that are recommended and implemented.

Solutions (maybe?)

There are no easy solutions here. When we have nations globally collecting data and making policy, our own federal, state, and local governments trying to understand and protect their community members and constituents while also supporting the economy and education, differing perspectives and ideas for the best approaches are naturally going to occur.

The following recommendations are just ideas, and as I know all-too-well none of this is easy or straightforward, but maybe they’d help:

A federal data governance strategy for COVID-19 data — how is the data collected? Where is it stored? Who gets to see the data? How is it shared publicly? Some data will specifically not be collected because at a federal level it may not be seen as important, but these decisions would hopefully be informed by experts from all levels of government and different industries. It would also set expectations on how the data flows to a federal central repository, but also how it is pushed back to States and localities.
An open data standard that makes data available to the public in a machine-readable way, that protects health data privacy, but also provides enough information so that the public can understand how decisions are made, and can potentially even check the work of policy makers. The federal government does this across its agencies on a daily basis. While other entities might collect their own data, this would create a single source of truth.
Federal government creates guidelines and some regulations based on publicly available metrics that are being tracked (case numbers, percent positive cases, etc).
State governments accept the federal guidelines, suggestions, and regulations, but can implement stricter guidelines, or with the support of the single source of data, could argue that for their state specifically, the federal guidelines are not necessary.
Local governments accept the state and federal guidelines and regulations, but are able to get more granular in their enforcement. In Syracuse, for instance, the city overall might have a high positivity rate, but through further analysis (that only a municipality would do because it is too detailed for a state or federal government), they might find that cases at Syracuse University are causing a rise in cases. They can show that analysis clearly, with supporting data that is available to all, and then could place regulations on certain parts of the city instead of others.
Local governments can also innovate more quickly than the state and federal government. Where creative programs are put into place that keep people safe or keep businesses or schools open, the data can be shown to justify the program, and then other cities across the country could replicate that program and track its success based on the single data source as well. This is only doable with agreed-upon data and metrics.

So, what do you think?

The code used for this post is all here: https://github.com/samedelstein/Onondaga_COVID/blob/master/Case_Differences.R

COVID-19 Data Management Challenges

Written by Sam Edelstein

No responses yet