Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.8 - Check here for latest version

Discretize by Binning (RapidMiner Studio Core)

Synopsis

This operator discretizes the selected numerical attributes into user-specified number of bins. Bins of equal range are automatically generated, the number of the values in different bins may vary.

Description

This operator discretizes the selected numerical attributes to nominal attributes. The number of bins parameter is used to specify the required number of bins. This discretization is performed by simple binning. The range of numerical values is partitioned into segments of equal size. Each segment represents a bin. Numerical values are assigned to the bin representing the segment covering the numerical value. Each range is named automatically. The naming format for range can be changed using the range name type parameter. Values falling in the range of a bin are named according to the name of that range. This operator also allows you to apply binning only on a range of values. This can be enabled by using the define boundaries parameter. The min value and max value parameter are used for defining the boundaries of the range. If there are any values that are less than the min value parameter, a separate range is created for them. Similarly if there are any values that are greater than the max value parameter, a separate range is created for them. Then, the discretization by binning is performed only on the values that are within the specified boundaries.

Differentiation

Discretize by Frequency

The Discretize By Frequency operator creates bins in such a way that the number of unique values in all bins are (almost) equal.

Discretize by Size

The Discretize By Size operator creates bins in such a way that each bin has user-specified size (i.e. number of examples).

Discretize by Entropy

The discretization is performed by selecting bin boundaries such that the entropy is minimized in the induced partitions.

Discretize by User Specification

This operator discretizes the selected numerical attributes into user-specified classes.

Input

  • example set (Data Table)

    This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process. the output of other operators can also be used as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data along-with the data. Note that there should be at least one numerical attribute in the input ExampleSet, otherwise the use of this operator does not make sense.

Output

  • example set (Data Table)

    The selected numerical attributes are converted into nominal attributes by binning and the resultant ExampleSet is delivered through this port.

  • original (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

  • preprocessing model

    This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.

Parameters

  • create_view It is possible to create a View instead of changing the underlying data. Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested and the result is returned without changing the data. Range: boolean
  • attribute_filter_typeThis parameter allows you to select the attribute selection filter; the method you want to use for selecting attributes. It has the following options:
    • all: This option simply selects all the attributes of the ExampleSet. This is the default option.
    • single: This option allows selection of a single attribute. When this option is selected another parameter (attribute) becomes visible in the Parameters panel.
    • subset: This option allows selection of multiple attributes through a list. All attributes of ExampleSet are present in the list; required attributes can be easily selected. This option will not work if meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel.
    • regular_expression: This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
    • value_type: This option allows selection of all the attributes of a particular type. It should be noted that types are hierarchical. For example real and integer types both belong to the numeric type. Users should have basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
    • block_type: This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. When this option is selected some other parameters (block type, use block type exception) become visible in the Parameters panel.
    • no_missing_values: This option simply selects all the attributes of the ExampleSet which don't contain a missing value in any example. Attributes that have even a single missing value are removed.
    • numeric value filter: When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
    Range: selection
  • attributeThe required attribute can be selected from this option. The attribute name can be selected from the drop down box of the parameter attribute if the meta data is known. Range: string
  • attributesThe required attributes can be selected from this option. This opens a new window with two lists. All attributes are present in the left list and can be shifted to the right list, which is the list of selected attributes. Range: string
  • regular_expressionThe attributes whose name match this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to try different expressions and preview the results simultaneously. Range: string
  • use_except_expressionIf enabled, an exception to the first regular expression can be specified. When this option is selected another parameter (except regular expression) becomes visible in the Parameters panel. Range: boolean
  • except_regular_expressionThis option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first regular expression (regular expression that was specified in the regular expression parameter). Range: string
  • value_typeThe type of attributes to be selected can be chosen from a drop down list. Range: selection
  • use_value_type_exception If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible in the Parameters panel. Range: boolean
  • except_value_typeThe attributes matching this type will not be selected even if they match the previously mentioned type i.e. value type parameter's value. Range: selection
  • block_typeThe block type of attributes to be selected can be chosen from a drop down list. Range: selection
  • use_block_type_exception If enabled, an exception to the selected block type can be specified. When this option is selected another parameter (except block type) becomes visible in the Parameters panel. Range: boolean
  • except_block_typeThe attributes matching this block type will not be selected even if they match the previously mentioned block type i.e. block type parameter's value. Range: selection
  • numeric_conditionThe numeric condition for testing examples of numeric attributes is specified here. For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: '> 6 && < 11' or '<= 5 || < 0'. But && and || cannot be used together in one numeric condition. Conditions like '(> 0 && < 2) || (>10 && < 12)' are not allowed because they use both && and ||. Use a blank space after '>', '=' and '<' e.g. '<5' will not work, so use '< 5' instead. Range: string
  • include_special_attributesThe special attributes are attributes with special roles which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the conditions. Range: boolean
  • invert_selectionIf this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected attributes are unselected and previously unselected attributes are selected. For example if attribute 'att1' is selected and attribute 'att2' is unselected prior to checking of this parameter. After checking of this parameter 'att1' will be unselected and 'att2' will be selected. Range: boolean
  • number_of_binsThis parameter specifies the number of bins which should be used for each attribute. Range: integer
  • define_boundaries:The Discretize by Binning operator allows you to apply binning only on a range of values. This can be enabled by using the define boundaries parameter. If this is set to true, discretization by binning is performed only on the values that are within the specified boundaries. The lower and upper limit of the boundary is specified by the min value and max value parameters respectively. Range: boolean
  • min_valueThis parameter is only available when the define boundaries parameter is set to true. It is used to specify the lower limit value for the binning range. Range: real
  • max_valueThis parameter is only available when the define boundaries parameter is set to true. It is used to specify the upper limit value for the binning range. Range: real
  • range_name_typeThis parameter is used to change the naming format for range. 'long', 'short' and 'interval' formats are available. Range: selection
  • automatic_number_of_digitsThis is an expert parameter. It is only available when the range name type parameter is set to 'interval'. It indicates if the number of digits should be automatically determined for the range names. Range: boolean
  • number_of_digitsThis is an expert parameter. It is used to specify the minimum number of digits used for the interval names. Range: integer

Tutorial Processes

Discretizing numerical attributes of the 'Golf' data set by Binning

The focus of this Example Process is the binning procedure. For understanding the parameters related to attribute selection please study the Example Process of the Select Attributes operator.

The 'Golf' data set is loaded using the Retrieve operator. The Discretize by Binning operator is applied on it. The 'Temperature' and 'Humidity' attributes are selected for discretization. The number of bins parameter is set to 2. The define boundaries parameter is set to true. The min value and max value parameters are set to 70 and 80 respectively. Thus binning will be performed only in the range from 70 to 80. As the number of bins parameter is set to 2, the range will be divided into two equal segments. Approximately speaking, the first segment of the range will be from 70 to 75 and the second segment of the range will be from 76 to 80. These are not exact values, but they are good enough for the explanation of this process. There will be a separate range for all those values that are less than the min value parameter i.e. less than 70. This range is automatically named 'range1'. The first and second segment of the binning range are named 'range2' and 'range3' respectively. There will be a separate range for all those values that are greater than the max value parameter i.e. greater than 80. This range is automatically named 'range4'. Run the process and compare the original data set with the discretized one. You can see that the values less than or equal to 70 in the original data set are named 'range1' in the discretized data set. The values greater than 70 and less than or equal to 75 in the original data set are named 'range2' in the discretized data set. The values greater than 75 and less than or equal to 80 in the original data set are named 'range3' in the discretized data set. The values greater than 80 in the original data set are named 'range4' in the discretized data set.