In February 2020, when the disease came to Europe, it became apparent to me that our timid hopes that the epidemics would subside and be finally buried in the China's soil were ruined. It was already evident from the Chinese statistics that the virus is lethal enough to scare and mild enough to pass unnoticed in many cases and, thus, to guarantee its effective dissemination. The question was when it reaches each next country.
Another question was the individual risks, especially the risk of lethal outcome if one contracts the virus. The average figure of around 5% was circulated by late January and early February. It was known that males were more susceptible to fatal outcomes. By February, it was also evident that the virus doesn't lead to death only in the elderly — the middle age was significantly affected, as well.
The mortality contingency tables from China showed an abrupt rise in the death tolls in each successive age group starting from the middle age. It disturbed me how an individual of 40 or 65 y.o., left with those contingency tables alone, could figure out his or her individual risk if his or her risk was grouped together with the risk in people up to 20 years younger or older and the person in question was right in between the two age groups with dramatically different fatality risks.
I believe that people who summarize and publish those pivoted tables or bar diagrams often don't realize that it is virtually impossible to meaningfully work with them for a statistician or data scientist. The unpivoted «raw» datasets hadn't been published, as unfortunately habitual for primary research. I hoped that it might be different this time, when the wellbeing of the whole world was put at risk. But no, even now, by early May, the majority of countries do not publish anonymized case-level datasets. Reading peered-reviewed journal articles, we may indirectly conclude that those datasets are made available exclusively to the privileged academic institutions establishing their monopoly in scientific research even in these difficult circumstances. I consider it both strange and disturbing.
After extensive search, which was focused on the countries with freely available COVID-19 testing and, at the same time, with comparatively widespread disease, particularly on South Korea, Germany, and the United States, I managed to find only one good case-level dataset published and maintained by a team of data scientists and data engineers collaborating in the Kaggle project Data Science for COVID-19 (DS4C), which is hosted at https://www.kaggle.com/kimjihoo/coronavirusdataset
After obtaining those data suitable for analysis, I needed to decide which tools to use. The most significant risk marker seemed to be the age, which already incorporated the duration of potential risk-relevant chronic disease, like COPD, coronary hearth disease, as well as the cumulative history of smoking and might well be associated with the body mass index, leading to pronounced multicollinearity (mutual correlation of factors) if treating all those factors separately in the model. The second known risk factor was the gender.
I was confronted with the primary task of making a model of the dependency of the risk of death on the age. It appears to be a candidate for the standard logistic regression, with either a logit or probit model underneath. But there were two obstacles to using it: the age is a strictly positive variable (it just doesn't fit to the model) and, understanding that the logistic regression has little to do with the regular regression in terms of simplicity of its calculation and that maximum likelihood estimation is actually working under the hood, I realized that any Python or R logistic regression package that attempts to attract large enough user base must be relatively fast and, thus, is forced to make a huge sacrifice by dramatically reducing its precision.
So, if I wanted to do it right, I had to write a new library tailored to this particular task.
In order to circumvent the first obstacle, I derived a logarithm from the age variable, which in turn was transformed into a polynomial to provide for the possibility of quite complex functional relationship between the age and the risk.
And, of course, since my goal was not to attract users, but to make more precise estimation, I didn't spare CPU resources needed to fit the relationship sigmoid curve. I was ready to allocate hours or days, if necessary, for the model fitting, provided I could see the progress.
As a result, the achieved logarithmic ML estimate with the 5th order internal polynomial was 3.7 higher than what was achieved with a probit model from the regular logistic regression Python package statsmodels.discrete.discrete_model with the 5th order internal polynomial (but, of course, without embracing it with the additional logarithm), which translates into more than a 40 times higher likelihood of the model.
The first version of the package was finalized and published under GPLv3 license on April 3, 2020, on GitHub.
Link to the plot