Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of square residues the most and add it as a base learner. This is repeated M times such that the final model is the sum of many many simple linear regression of the form y against xi (1 exogenous variable only), basically gradient boosting using linear regression as the base learners.

The problem is that since I am performing a rolling window regression on the time series data, I have to do N × M × T regressions which is more than a million OLS. Though each OLS is very fast, it takes a few hours to run on my weak laptop.

Currently, I am using statsmodels.OLS.fit() as the way to get my parameters for each y against xi linear regression as such. The z_matrix is the data matrix and the i represents the ith column to slice for the regression. The number of rows is about 100 and z_matrix is about size 100 × 500.

 ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit() return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]

I have read from a previous post in 2016 Fastest way to calculate many regressions in python? that using repeated calls to statsmodels is not efficient and I tried one of the answers which suggested numpy's pinv which is unfortunately slower:

 # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog) y_hat = self.z_matrix[:, [i]]@params ssr = sum((y_hat-endog)**2) return params, ssr, y_hat

Does anyone have any better suggestions to speed up the computation of the linear regression? I just need the estimated parameters, sum of square residues, and predicted ŷ value. Thank you!

3

1 Answer

Here is one way since you are always running regressions without a constant. This code runs around 900K models in about 0.5s. It retains the sse, the predicted values for each of the 900K regressions, and the estimated parameters.

The big idea is to exploit the math behind regressions of one variable on another, which is the ratio of a cross-product to an inner product (which the model does not contain a constant). This could be modified to also include a constant by using a moving window demean to estimate the intercept.

import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime
gen = np.random.default_rng(20210514)
# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))
now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed
# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
# [y[1],y[2],...,y[100]],
# ...,
# [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]): # Reshape a columns of x like y x_block = x[locs, i] # Get the parameters and make sure it is 2d with shape (m-w+1, 1) # so the broadcasting works p = params[:, i][:, None] # Get the predicted value y_pred[i] = x_block * p # And the sse sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like