Fitting the SIR model of disease to data in Python
Introduction and the problem
The SIR model for spread of disease was first proposed in 1927 in a collection of three articles in the Proceedings of the Royal Society by Anderson Gray McKendrick and William Ogilvy Kermack; the resulting theory is known as Kermack–McKendrick theory; now considered a subclass of a more general theory known as compartmental models in epidemiology. The three original articles were republished in 1991, in a special issue of the Bulletin of Mathematical Biology.
The SIR model is so named because it assumes a static population, so no births or deaths, divided into three mutually exclusive classes: those susceptible to the disease; those infected with the disease, and those recovered with immunity from the disease. This model is clearly not applicable to all possible epidemics: there may be births and deaths, people may be re-infected, and so on. More complex models take these and other factors into account.
The SIR model consists of three non-linear ordinary differential equations, parameterized by two growth factors \(\beta\) and \(\gamma\):
\begin{eqnarray*} \frac{dS}{dt}&=&-\frac{\beta IS}{N}\\ \frac{dI}{dt}&=&\frac{\beta IS}{N}-\gamma I\\ \frac{dR}{dt}&=&\gamma I \end{eqnarray*}
Here \(N\) is the population, and since each of \(S\), \(I\) and \(R\) represent the number of people in mutually exclusive sets, we should have \(S+I+R=N\). Note that the right hand sides of the equations sum to zero, hence \[ \frac{dS}{dt}+\frac{dI}{dt}+\frac{dR}{dt}=\frac{dN}{dt}=0 \] which indicates that the population is constant.
The problem here is to see if values of \(\beta\) and \(\gamma\) can be found which will provide a close fit to a well-known epidemiological case study: that of influenza in a British boarding school. This was described in a 1978 issue of the British Medical Journal.
This should provide a good test of the SIR model, as it satisfies all of the criteria. In this case there was a total population of 763, and an outbreak of 14 days, with infected numbers as:
Day: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | ||
Infections: | 1 | 3 | 6 | 25 | 73 | 222 | 294 | 258 | 237 | 191 | 125 | 69 | 27 | 11 | 4 |
Some years ago, on my old (and now taken off line) blog, I explored using Julia for this task, which was managed easily (once I got the hang of Julia syntax). However, that was five years ago, and when I tried to recreate it I found that Julia has changed over the last five years, and my original comments and code no longer worked. So I decided to experiment with Python instead, in which I have more expertise, or at least, experience.
Using Python
Setup and some initial computation
We start by importing some of the modules and functions we need, and define the data from the table above:
1import matplotlib.pyplot as plt
2import numpy as np
3from scipy.integrate import solve_ivp
4
5data = [1, 3, 6, 25, 73, 222, 294, 258, 237, 191, 125, 69, 27, 11, 4]
Now we enter the SIR system with some (randomly chosen) values of \(\beta\) and
\(\gamma\), using syntax conformable with the solver solve_ivp
:
1beta,gamma = [0.01,0.1]
2
3def SIR(t,y):
4 S = y[0]
5 I = y[1]
6 R = y[2]
7 return([-beta*S*I, beta*S*I-gamma*I, gamma*I])
We can now solve this system using solve\_ivp
and plot the results:
1sol = solve_ivp(SIR,[0,14],[762,1,0],t_eval=np.arange(0,14.2,0.2))
2
3fig = plt.figure(figsize=(12,4))
4plt.plot(sol.t,sol.y[0])
5plt.plot(sol.t,sol.y[1])
6plt.plot(sol.t,sol.y[2])
7plt.plot(np.arange(0,15),data,"k*:")
8plt.grid("True")
9plt.legend(["Susceptible","Infected","Removed","Original Data"])
Note that we have used the t_eval
argument in our call to solve_ivp
which
allows us to exactly specify the points at which the solution will be given.
This will allow us to align points in the computed values of \(I\) with the
original data.
The output is this plot:
There is a nicer plot, with more attention paid to setup and colours,
here.
This use of scipy to solve the SIR equations uses the odeint
tool. This
works perfectly well, but I believe it is being deprecated in favour of
solve_ivp
.
Fitting the data
To find values \(\beta\) and \(\gamma\) with a better fit to the data, we start by defining a function which gives the sum of squared differences between the data points, and the corresponding values of \(I\). The problem will then be minimize that function.
1def sumsq(p):
2 beta, gamma = p
3 def SIR(t,y):
4 S = y[0]
5 I = y[1]
6 R = y[2]
7 return([-beta*S*I, beta*S*I-gamma*I, gamma*I])
8 sol = solve_ivp(SIR,[0,14],[762,1,0],t_eval=np.arange(0,14.2,0.2))
9 return(sum((sol.y[1][::5]-data)**2))
As you see, we have just wrapped the SIR definition and its solution inside a calling function whose variables are the parameters of the SIR equations.
To minimize this function we need to import another python function first:
1from scipy.optimize import minimize
2
3msol = minimize(sumsq,[0.001,1],method='Nelder-Mead')
4msol.x
with output:
1array([0.00218035, 0.44553886])
To see if this does provide a better fit, we can simply run the solver with these values, and plot them, as we did at the beginning:
1beta,gamma = msol.x
2def SIR(t,y):
3 S = y[0]
4 I = y[1]
5 R = y[2]
6 return([-beta*S*I, beta*S*I-gamma*I, gamma*I])
1sol = solve_ivp(SIR,[0,14],[762,1,0],t_eval=np.arange(0,14.2,0.2))
1fig = plt.figure(figsize=(10,4))
2plt.plot(sol.t,sol.y[0],"b-")
3plt.plot(sol.t,sol.y[1],"r-")
4plt.plot(sol.t,sol.y[2],"g-")
5plt.plot(np.arange(0,15),data,"k*:")
6plt.legend(["Susceptible","Infected","Removed","Original Data"])
with output:
and as you see, a remarkably close fit!