You are viewing the RapidMiner Studio documentation for version 8.0 - Check here for latest version
Correlation Matrix
(Concurrency)
Synopsis
This operator determines correlation between all attributes and it can produce a weights vector based on these correlations. Correlation is a statistical technique that can show whether and how strongly pairs of attributes are related.Description
A correlation is a number between -1 and +1 that measures the degree of association between two attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa.
Suppose we have two attributes X and Y, with means X' and Y' respectively and standard deviations S(X) and S(Y) respectively. The correlation is computed as summation from 1 to n of the product (X(i)-X').(Y(i)-Y') and then dividing this summation by the product (n-1).S(X).S(Y) where n is total number of examples and i is the increment variable of summation. There can be other formulas and definitions but let us stick to this one for simplicity.
As discussed earlier a positive value for the correlation implies a positive association. Suppose that an X value was above average, and that the associated Y value was also above average. Then the product (X(i)-X').(Y(i)-Y') would be the product of two positive numbers which would be positive. If the X value and the Y value were both below average, then the product above would be of two negative numbers, which would also be positive. Therefore, a positive correlation is evidence of a general tendency that large values of X are associated with large values of Y and small values of X are associated with small values of Y.
As discussed earlier a negative value for the correlation implies a negative or inverse association. Suppose that an X value was above average, and that the associated Y value was instead below average. Then the product (X(i)-X').(Y(i)-Y') would be the product of a positive and a negative number which would make the product negative. If the X value was below average and the Y value was above average, then the product above would also be negative. Therefore, a negative correlation is evidence of a general tendency that large values of X are associated with small values of Y and small values of X are associated with large values of Y.
This operator can be used for creating a correlation matrix that shows correlations of all the attributes of the input ExampleSet. Please note that this operator performs a data scan for each attribute combination and might therefore take some time for non-memory ExampleSets. The attribute weights vector; based on the correlations can also be returned by this operator. Using this weights vector, highly correlated attributes can be removed from the ExampleSet with the help of the Select by Weights operator. Highly correlated attributes can be more easily removed by simply using the Remove Correlated Attributes operator. Correlated attributes are usually removed because they are similar in behavior and will have similar impact in prediction calculations, so keeping attributes with similar impacts is redundant. Removing correlated attributes saves space and time of calculation of complex algorithms.
Input
example set (Data Table)
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.
Output
example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
matrix (Numerical Matrix)
The correlations of all attributes of the input ExampleSet are calculated and the resultant correlation matrix is returned from this port.
weights (Attribute Weights)
The attribute weights vector based on the correlations of the attributes is delivered through this output port.
Parameters
- attribute_filter_typeThis parameter allows you to select the attribute selection filter; the method you want to use for selecting the required attributes. It has the following options:
- all: This option simply selects all the attributes of the ExampleSet. This is the default option.
- single: This option allows selection of a single attribute. When this option is selected another parameter (attribute) becomes visible in the Parameters panel.
- subset: This option allows selection of multiple attributes through a list. All attributes of the ExampleSet are present in the list; required attributes can be easily selected. This option will not work if the meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel.
- regular_expression: This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
- value_type: This option allows selection of all the attributes of a particular type. It should be noted that types are hierarchical. For example real and integer types both belong to the numeric type. Users should have a basic understanding of type hierarchy when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
- block_type: This option is similar in working to the value type option. This option allows selection of all the attributes of a particular block type. When this option is selected some other parameters (block type, use block type exception) become visible in the Parameters panel.
- no_missing_values: This option simply selects all the attributes of the ExampleSet which don't contain a missing value in any example. Attributes that have even a single missing value are removed.
- numeric value filter: When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
- attributeThe desired attribute can be selected from this option. The attribute name can be selected from the drop down box of attribute parameter if the meta data is known. Range: string
- attributesThe required attributes can be selected from this option. This opens a new window with two lists. All attributes are present in the left list and can be shifted to the right list which is the list of selected attributes on which the conversion from nominal to numeric will take place; all other attributes will remain unchanged. Range: string
- regular_expressionThe attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions. This menu also allows you to try different expressions and preview the results simultaneously. This will enhance your concept of regular expressions. Range: string
- use_except_expressionIf enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible in the Parameters panel. Range: boolean
- except_regular_expressionThis option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter). Range: string
- value_typeThe type of attributes to be selected can be chosen from a drop down list. One of the following types can be chosen: nominal, text, binominal, polynominal, file_path. Range: selection
- use_value_type_exception If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible in the Parameters panel. Range: boolean
- except_value_typeThe attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter's value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path. Range: selection
- block_typeThe block type of attributes to be selected can be chosen from a drop down list. The only possible value here is 'single_value' Range: selection
- use_block_type_exception If enabled, an exception to the selected block type can be specified. When this option is selected another parameter (except block type) becomes visible in the Parameters panel. Range: boolean
- except_block_typeThe attributes matching this block type will be removed from the final output even if they matched the previously mentioned block type. Range: selection
- numeric_conditionThe numeric condition for testing examples of numeric attributes is specified here. For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: '> 6 && < 11' or '<= 5 || < 0'. But && and || cannot be used together in one numeric condition. Conditions like '(> 0 && < 2) || (>10 && < 12)' are not allowed because they use both && and ||. Use a blank space after '>', '=' and '<' e.g. '<5' will not work, so use '< 5' instead. Range: string
- include_special_attributesThe special attributes are attributes with special roles which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. Range: boolean
- invert_selectionIf this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected attributes are unselected and previously unselected attributes are selected. For example if attribute 'att1' is selected and attribute 'att2' is unselected prior to checking of this parameter. After checking of this parameter 'att1' will be unselected and 'att2' will be selected. Range: boolean
- normalize_weightsThis parameter indicates if the weights of the resultant attribute weights vector should be normalized. If set to true, all weights are normalized such that the minimum weight is 0 and the maximum weight is 1. Range: boolean
- squared_correlationThis parameter indicates if the squared correlation should be calculated. If set to true, the correlation matrix shows squares of correlations instead of simple correlations. Range: boolean
Tutorial Processes
Correlation matrix of the Golf data set
The 'Golf' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can view the ExampleSet. As you can see, the ExampleSet has 4 regular attributes i.e. 'Outlook', 'Temperature', 'Humidity' and 'Wind'. The Correlation Matrix operator is applied on it. The weights vector generated by this operator is provided to the Select by Weights operator along with the 'Golf' data set. The parameters of the Select by Weights operator are adjusted such that the attributes with weights greater than 0.5 are selected and all other attributes are removed. This is why the resultant ExampleSet does not have the 'Temperature' attribute (weight=0). The correlation matrix, weights vector and the resultant ExampleSet can be viewed in the Results Workspace.