Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x₁, x₂, ..., x_N, we would regress y against x₁ only, y against x₂ only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against x_i (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.

The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.

Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against x_i linear regression as such. The z_matrix is the data matrix and the i represents the i^th column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.

 ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit() return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]

I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:

 # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog) y_hat = self.z_matrix[:, [i]]@params ssr = sum((y_hat-endog)**2) return params, ssr, y_hat

Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!

1 Answer

Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters.

The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept.

import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime
gen = np.random.default_rng(20210514)
# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))
now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed
# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
# [y[1],y[2],...,y[100]],
# ...,
# [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]): # Reshape a columns of x like y x_block = x[locs, i] # Get the parameters and make sure it is 2d with shape (m-w+1, 1) # so the broadcasting works p = params[:, i][:, None] # Get the predicted value y_pred[i] = x_block * p # And the sse sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])

Star Pop News

Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

1 Answer

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

How do I display EXP gain when killing a creep?

Can you lose a purchased horse in Skyrim?

Disable mob spawn in a specific area of a multiplayer map

Can I change the controls for Super Mario 64?