by Jakob Schröder and Dr. Philipp Bongartz
This blog post is a data exploration in the context of an internship. We will apply data science methods to Covid-19 data to give a little insight into both data science and the Covid-epidemic. We will present some simple methods and encounter some typical pitfalls – and hopefully unearth some interesting facts.
The Covid-19 data we use are aggregated by Johns Hopkins University and are freely available. Other datapoints had to be gleaned from various news reports or Wikipedia articles. We restrict our analyses to countries where the data we are interested in are available.
The thriller that has kept us on our toes for a good three months has the following plot. The initial number of diagnosed cases is low, there are few or no deaths. But the number of cases is increasing exponentially, with a doubling of the number of cases about every three days.
This exponential growth is a direct consequence of the dynamics of infection. With Covid-19, one infected person infects on average about 3 other people. These in turn infect 3 people. This means that the original case in the second generation has already produced 9 cases. In the third generation, 27, then 81 and so on. Like most exponential growth curves in nature, this is actually a logistical function. The exponent of growth falls linearly with the proportion of infectious persons in the population. Obviously, if one of the three people is already immune, then one infected person will only infect two.
In the initial phase, however, this decline of growth basically doesn’t play a role, and we are dealing with unchecked, exponential growth. This means that the situation is critical long before it looks critical. For example, an intensive care unit where only 20% of the beds are for Covid-19 patients could reach the limits of its capacity just one week later, if no measures are taken to curb the spread of the virus.
On the other hand, this dynamic is also very predictable. As can be seen from the excellent exponential fit of the above plots, it was possible to predict relatively precisely in which time frame the number of cases would threaten to overload the health care system.
Unfortunately, it seems to be a peculiarity of Western democracies that politicians only dare to take radical measures when the necessity of these measures is also largely clear to their voters. Nevertheless, sooner or later, massive measures were taken in all affected countries to slow down the spread of the virus, culminating in extensive social distancing measures up to and including curfew.
Effect of the lockdown
The dates of introduction of curfew are shown as “lockdown” in the following plots. The period in which other measures were already taken before the lockdown is shaded red.
One can see how the case numbers begin to deviate from the exponential curve. Typically, the deviation begins with a certain delay after the first measures taken. In Germany, for example, this was the closure of schools and kindergartens and the associated expansion of remote work, which were carried out almost ten days before the actual lockdown. In some federal states, mass events were also banned much earlier.
In comparison to an exponential course, it is evident that the number of cases at this point in time is already at least an order of magnitude lower than would have been expected with unchecked growth. This is very relevant because, with a certain lag, the number of deaths rises as a percentage of the number of cases, as we can see in figure 3.
Not all countries implemented strict lockdown rules. Relative to other European countries, Sweden followed a special path with “recommendations” from the government and fewer restrictions.
Compared to its Scandinavian neighbors, both the number of cases and the number of deaths remain at a high level (That very last peak in the number of cases, however, is likely due to older data points being integrated at a later date.)
Heterogeneity in the fatality rate
As it is not yet possible to estimate the possible long-term health effects of Covid-19, the number of expected deaths is the main factor determining the policies. One problem is that for a variety of reasons it is very difficult to calculate the actual mortality rate of Covid-19 patients.
The ratio of diagnosed deaths to cases varies greatly from one country to another. At one end of the scale is Iceland, for example, with a fatality rate of 0.5%, while at the other end are countries such as Sweden or Italy with fatality rates above 13%. This heterogeneity of the data lead to some speculation.
The fatality rate in Germany is significantly lower than in France, and in France again slightly lower than in Italy. The same pattern is also found in the German, French and Italian-speaking parts of Switzerland. Are cultural or even genetic differences responsible for this? The quality of the health system? The behavior of the population? The frequency of multi-generational households?
Are we now dealing with different virus strains that may have different degrees of lethality? Or is everything a question of data quality? Do some countries deliberately report false figures to improve the situation? Are the criteria for which patients are counted as Covid-19 cases simply too different?
Since Covid-19 patients often die weeks after infection, the mortality rate at the beginning of an epidemic is underestimated and the ratio of deaths to cases increases significantly over time. This was clearly observed in Germany. The initial mortality rate was around 0.5% and then over time increased to about 4%.
Moreover, it is clear that not all cases are recorded, otherwise the epidemic would quickly be over. This under-sampling of cases and also of deaths could explain some of the differences.
Stage of the epidemic
In the following, we want to analyze how strong the influence of these last two points is on the heterogeneity of fatality rates in different countries. To do so, we first look at the relationship between the onset of the epidemic and the mortality rate and then try to quantify the under-sampling of cases. To this end, we are creating two data sets, one consisting of countries and the other of US states, for which we have been able to identify the under-sampling and epidemic-stage indicators we use.
To compare the stage of the epidemic in different countries, we define the day when the cumulative number of deaths in a given country reaches 80 as the beginning of the epidemic in that country. We see that the number of days that have passed since the beginning of the epidemic is clearly and significantly correlated with the death rate. In fact, the stage of the epidemic alone explains more than a third of the variance in our country data.
We can validate this result directly on our second data set: Among the US states, the epidemic stage also explains about one third of the variance in the fatality rate.
In order to quantify the impact of under-sampling, we are compiling various indicators for under-sampling. The stronger the under-sampling, the more likely it is that only the severe cases are covered. Serious cases have certain characteristics that we can use as indicators for under-sampling. For example, more than 60% of deaths are male and the average age of the victims is relatively high. It is therefore an indicator for under-sampling if the cases recorded are disproportionately male or old.
Another indicator is somewhat more straightforward: the ratio of positive tests to tests performed will be higher the more under-sampled.
This gives us three indicators of under-sampling:
- the deviation of the average age of diagnosed cases from the average age of the population,
- the numerical ratio of men to women among diagnosed cases, and
- the ratio of the number of tests performed to the number of diagnosed cases.
After we have gathered the relevant data, we must first check whether our data contain any meaningful information at all regarding the fatality rate. We find that our indicator “percentage of male cases” does not correlate with the fatality rate, so it is excluded from the analysis.
Our other two indicators show a robust correlation with the fatality rate. This makes them suitable as indicators of under-sampling. The Italian outlier is conspicuous. Our interpretation is that there the overburdening of the healthcare system has led to a doubling of the mortality rate. In order not to distort our results, we exclude Italy from our further analyses.
Using the US states, we can then validate the correlation of our indicators with the mortality rate. This validation is necessary because we test several hypotheses and our data set is not very large. For example, the correlation of the age difference with the mortality rate in our country data set has a P-value of only 0.10, and we can only prove the statistical significance of this correlation with the US data set.
Principal component analysis of the under-sampling indicators
Our indicators are of course also influenced by factors other than under-sampling. The percentage of positive tests also depends on the general population coverage and the quality of diagnostic tools or contact tracing. The average age of those infected could depend strongly on the circles in which the virus first spread and whether, for example, retirement homes could be successfully isolated and whether deaths in such homes are even reported.
We therefore conduct a principal component analysis for all the countries for which we have identified our indicators. The principal component resulting from this analysis should describe the common factor of our indicators and thus be a more robust measure of under-sampling than any single indicator.
And indeed, our main component for both countries and US states correlates much more strongly with the mortality rate than the individual indicators.
Correcting for correlated predictors
However, there is a methodological problem here: since under-sampling also changes during the epidemic, our main under-sampling component also contains some of the information we have already used to explain one-third of the heterogeneity via the epidemic stage.
For our analysis to be statistically sound, we remove the influence of the “day of the epidemic” from the indicators. For this purpose, we calculate a linear fit between the “day of the epidemic” and the respective indicator and extrapolate all indicator values up to the same “day of the epidemic”. Thus, we obtain adjusted indicators that do not contain any information about the stage of the epidemic. This allows us to quantify how much of the variance in addition to the stage of the epidemic is explained by our main under-sampling component.
The main component resulting from these adjusted indicators is now also independent of the stage of the epidemic. As the following plots show, the adjusted main under-sampling component explains a further 43% of the variance of the fatality rate in our country data set and 50% of the fatality rate differences between the US states.
How much of the variance in fatality rate can we explain?
If we now add the explained variance of our uncorrelated predictors, the stage of the epidemic and under-sampling, we will arrive at an explained variance of roughly 80 percent. However, in the country-dataset we have removed an outlier in between these analyses, while in our validation set, we used the state-of-epidemic relation of all US states to remove this factor from the under-sampling indicator of the smaller US-dataset.
These choices likely inflate the explained variance. In fact, a multiple regression only finds an explained variance of roughly 60 percent for our various datasets and its combinations. However, considering that the data is very noisy, this is still a very far reaching explanation for the heterogeneity in the data that has led to so much speculation.
Effect of average age
In fact, we can probably explain a little more of the variance if we take into account the age of the population. Since Covid-19 is so much more lethal in the older age groups than in the younger ones, it is not surprising that the percentage of the population older than 65 is also related to the death rate. However, a clear correlation can only be seen if the extreme outliers Japan and Florida are excluded. With more robust data, we could probably explain a few percent of the variance here, since the “percentage over 65” is independent of our other factors.
Of course, even with such a simple analysis there are caveats. For example, it could be possible that the actual causal factor behind the differences in fatality rate correlates very strongly with our indicators. For example, the quality of the health care system could be very strongly related both to the success of treatment and to the extent of testing.
Additionally, our stage-of-epidemic measure is somewhat confounded by the possibility that countries that experience the epidemic later in time, might profit from best practices that weren’t known earlier. Untangling these factors would be an interesting direction of further research.
Our analysis therefore does not exclude other causal factors, but only makes them less plausible to varying degrees.
It should also be pointed out that we limit ourselves to developed countries and that, due to the poor data situation, it is much more unclear how the situation is developing in Latin America or Africa, for example.
Estimating the fatality rate
But what is the actual fatality rate? What percentage of all cases would be fatal if we diagnosed every single infection? Our data allow us to try to answer this question. Since we have modelled the relationship between the age difference between the diagnosed cases and the average population and the mortality rate, we can extrapolate what mortality rate would be expected if this age difference were 0 years, i.e. if the cases were representative of the total population in terms of age.
However, we note that the extrapolated fatality rate of 2.3% for the countries and even 2.9% for the US states seems unrealistically high. Serological tests in the USA, Germany, France and Spain all point to a fatality rate of about 1%. If unrecorded deaths are considered via excess mortality, this fatality rate might go up to 1.5%. Still, our estimates seem to be too high by a factor of two.
This is probably due to the fact that in our analysis we assume that the age distribution of those infected reflects the age distribution of the total population. This is probably not the case. Younger people are more socially active and are more likely to take the risk of a, for them, relatively harmless infection. Therefore, they are probably slightly overrepresented among the infected. To arrive at the correct death rate, we would likely have to extrapolate to an age difference of roughly -5 years.
However, it could be argued that our values more accurately reflect the lethality of SARS-CoV-2 because it calculates the mortality rate representatively for the whole population.
Overall, these analyses make it clear that the lethality of SARS-CoV-2 is much more similar throughout the developed world than a first glance at the data suggests. Neither existing immunity, genetic differences, nor certain social structures can have a major influence on the data we are investigating, since the apparent differences are explained to a very large extent by the epidemic stage and the extent of under-sampling. How many people in a country will eventually die of Covid-19 therefore depends primarily on the extent to which the spread of the virus can be contained.
Show all posts by Dr. Philipp Bongartz