Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.10 - Check here for latest version

Polynomial Regression (RapidMiner Studio Core)

Synopsis

This operator generates a polynomial regression model from the given ExampleSet. Polynomial regression is considered to be a special case of multiple linear regression.

Description

Polynomial regression is a form of linear regression in which the relationship between the independent variable x and the dependent variable y is modeled as an nth order polynomial. In RapidMiner, y is the label attribute and x is the set of regular attributes that are used for the prediction of y. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y | x), and has been used to describe nonlinear phenomena such as the growth rate of tissues and the progression of disease epidemics. Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The goal of regression analysis is to model the expected value of a dependent variable y in terms of the value of an independent variable (or vector of independent variables) x. In simple linear regression, the following model is used:

y = w0 + ( w1 * x )

In this model, for each unit increase in the value of x, the conditional expectation of y increases by w1 units.

In many settings, such a linear relationship may not hold. For example, if we are modeling the yield of a chemical synthesis in terms of the temperature at which the synthesis takes place, we may find that the yield improves by increasing amounts for each unit increase in temperature. In this case, we might propose a quadratic model of the form:

y = w0 + (w1 * x1 ^1) + (w2 * x2 ^2)

In this model, when the temperature is increased from x to x + 1 units, the expected yield changes by w1 + w2 + 2 (w2 * x). The fact that the change in yield depends on x is what makes the relationship nonlinear (this must not be confused with saying that this is nonlinear regression; on the contrary, this is still a case of linear regression). In general, we can model the expected value of y as an nth order polynomial, yielding the general polynomial regression model:

y = w0 + (w1 * x1 ^1) + (w2 * x2 ^2) + . . . + (wm * xm ^m)

Regression is a technique used for numerical prediction. It is a statistical measure that attempts to determine the strength of the relationship between one dependent variable ( i.e. the label attribute) and a series of other changing variables known as independent variables (regular attributes). Just like Classification is used for predicting categorical labels, Regression is used for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience, or the potential sales of a new product given its price. Regression is often used to determine how much specific factors such as the price of a commodity, interest rates, particular industries or sectors influence the price movement of an asset.

Differentiation

Linear Regression

Polynomial regression is a form of linear regression in which the relationship between the independent variable x and the dependent variable y is modeled as an nth order polynomial.

Input

  • training set (Data Table)

    This input port expects an ExampleSet. This operator cannot handle nominal attributes; it can be applied on data sets with numeric attributes. Thus often you may have to use the Nominal to Numerical operator before application of this operator.

Output

  • model (Model)

    The regression model is delivered from this output port. This model can now be applied on unseen data sets.

  • example set (Data Table)

    The ExampleSet that was given as input is passed without any modifications to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • max_iterationsThis parameter specifies the maximum number of iterations to be used for the model fitting. Range: integer
  • replication_factorThis parameter specifies the amount of times each input variable is replicated, i.e. how many different degrees and coefficients can be applied to each variable. Range: integer
  • max_degreeThis parameter specifies the maximal degree to be used for the final polynomial. Range: integer
  • min_coefficientThis parameter specifies the minimum number to be used for the coefficients and the offset. Range: real
  • max_coefficientThis parameter specifies the maximum number to be used for the coefficients and the offset. Range: real
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Using the same value of the local random seed will produce the same randomization. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Applying the Polynomial Regression operator on the Polynomial data set

The 'Polynomial' data set is loaded using the Retrieve operator. The Split Data operator is applied on it to split the ExampleSet into training and testing data sets. The Polynomial Regression operator is applied on the training data set with default values of all parameters. The regression model generated by the Polynomial Regression operator is applied on the testing data set of the 'Polynomial' data set using the Apply Model operator. The labeled data set generated by the Apply Model operator is provided to the Performance (Regression) operator. The absolute error and the prediction average parameters are set to true. Thus the Performance Vector generated by the Performance (Regression) operator has information regarding the absolute error and the prediction average in the labeled data set. The absolute error is calculated by adding the difference of all predicted values from the actual values of the label attribute, and dividing this sum by the total number of predictions. The prediction average is calculated by adding all actual label values and dividing this sum by the total number of examples. You can verify this from the results in the Results Workspace.