How to Use Regression Analysis in Data Analytics: A Simple Guide


Statistics is a big part of big data analytics. To make data driven decisions, it may be necessary to parse through all data available using a regression analysis. Regression analysis can be used to measure how closely related independent variable(s) relate with a dependent variable. It can estimate the strength and direction.


There are many types of regression analysis but linear regression is the easiest to use and interpret. Linear regression is very helpful for hypothesis testing. Hypothesis testing is one of the areas that can help companies make better decisions. Linear regression is flexible and can be used to analyse a wide variety of relationship.


 

How does it work?


The Manager usually first needs to decide which variables will need to be tested to check the impact on the dependent variable. Data on the variable is then gathered. There are also methods of checking if the right variables are included in the regression analysis such as


Regression analysis formula: Y = mx + b, where y is dependent variable, x is independent variable, M is slope of the regression (line) and B is constant of the equation. The independent variables are used to explain the factors that influence the dependent variables.


The simplest form of a linear regression is a single variable linear regression. However, a more common form of linear regression is a multivariate linear regression, where a relationship between multiple independent variables and the dependent variable is estimated.


Y = β0 + β1X1 + β2X2 + β3X3 + ... + βkXk + u


where error term u contains factors other than X1, ....., Xk that affect Y.


There are a number of ways to calculate a linear regression. The most popular is the ordinary least-squares method. It estimates the unknown variables in the data, which visually turns into the sum of the vertical distances between the data points and the trend line. It minimises the sum of squared residuals.


It is important to consider the right variables are included. Otherwise, this may cause problems such as omitted variable bias, where the regression model leaves out one or more relevant variables. The omitted variable bias affects the estimated effects of the included variables.


However, if we add irrelevant variable, the regression does not affect the unbiasness of the OLS estimators but increase their variances.


 

Results of the model


We can look at the goodness of fit of the model using R squared. An important fact is R^2 never decreases in the number of independent variables. It follows the algebraic formula that sum of squared residuals never decreases when additional regressors (variables) are added to the model.


One idea of imposing a penalty when adding regressors that are not useful is the adjusted R squared. It is a modified version of R squared where it increases only if the new term improves the model more than would be expected by chance because it adjusts for the number of variables in the model.


After regressing, you will find it useful to get the test statistics. It is used for hypothesis testing. Test statistics is defined as slope of the sample regression line/ standard error of the slope. This can be computed easily using statistical software such as Excel, SPSS and R.


Hypothesis testing is important because it aims to determine if an observation of some phenomenon is likely to have really occurred based on statistics. You first state the null hypothesis and alternative hypothesis. You then compare the test statistics with a likelihood of chance usually at the 1%, 5% and 10% level.


If you reject the null hypothesis, you are claiming that your result is statistically significant and that it did not happen by luck or chance. Hypothesis testing can improve business decisions.


 

How is it applied to businesses?


There are a few examples. Banks can understand the top five factors that caused a customer to default. Based on this, it can then predict which customers will default on credit card payments. In this way, banks can then minimise default among risky customers.


If a company wants to fund a marketing campaign to increase sales, it can use the campaign in one of the regions and collect data. It can then test if the campaign had the desired results. It helps in decision making where the company only continue the new campaign if it believes sales will rise by more than 25%.


In manufacturing, it allows manufacturers to better understand quality data and provide guidance to quality control. It is more useful than using simple arithmetic mean and/ or median.


 

Linear regression is a very powerful statistical technique that can be used for analysing causal relationship and provide prediction for the dependent variable. You will still always need to lay your intuition on top of the data, which means asking if the results fit your understanding of the situation.



How do you use regression analysis in your business? Share by leaving us a comment. If you require more information or assistance on analytics activities, contact us. We want to be an extension of our clients. Subscribe to our newsletter for regular feeds.

Did you find this blog post helpful? Share the post! Have feedback or other ideas? We'd love to hear from you.


 

References


HBR, A refresher on regression analysis, https://hbr.org/2015/11/a-refresher-on-regression-analysis, published 4 November 2015

Chron, Hypothesis testing used in business, https://smallbusiness.chron.com/hypothesis-testing-used-business-22682.html, accessed 28 February 2019

579 views10 comments