matrix - What method does rapidminer use to calculate correlation matrix and why am I getting negative correlations for two categorical/nominal attributes?

Question

I am hoping someone can answer this for me as I am stuck.

What methodology does rapidminer use in it's correlation matrix? For all data combinations would be nice, but most importantly for nominal/categorical data sets?

I am using rapidminer to build a correlation matrix and have been careful to properly label all attributes as numbers, binominal, polynominal, etc. I am finding that my matrix shows negative correlations for some of the nominal/nominal combinations of attributes, which doesn't make since based on the methods that I would normally think would be chosen (Phi, Cramer's V, Contingency Coefficient) to calculate this. I thought the correlation had to be positive for these tests, and it doesn't make sense to have a "negative" correlation between categories like gender and city as that would suggest an order in the data.

Is there another test used, or dummy coding or something? And if dummy coding is used how reliable is the value obtained?

Thank you in advance to anyone who can help me. Hate to admit when I am lost, but here I am needing a map :)

score 0 · Accepted Answer

I've included the XML of a process that calculates a correlation matrix for an example set containing nominal values and again for the same example set with the nominals converted to numbers. The process produces the same matrix when the nominals are converted to simple numbers i.e. value1 becomes 0, value2 becomes 1 and so on.

From the help of the Correlation Matrix operator, each attribute value is subtracted from the mean for that attribute. These differences are multiplied for pairs of attributes and summed for all examples. This is then divided by the product of the number of examples - 1 and the standard deviations of the attribute pairs. I managed to recreate the calculation in a spreadsheet, hence I know the standard deviation used is for a sample rather than the population.

Here's the process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_nominal_data" compatibility="7.1.001" expanded="true" height="68" name="Generate Nominal Data" width="90" x="45" y="85">
        <parameter key="number_examples" value="20"/>
        <parameter key="number_of_attributes" value="3"/>
        <parameter key="number_of_values" value="3"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="label"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.1.001" expanded="true" height="103" name="Multiply" width="90" x="313" y="85"/>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.1.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="447" y="289">
        <parameter key="coding_type" value="unique integers"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="correlation_matrix" compatibility="7.1.001" expanded="true" height="103" name="Correlation Matrix" width="90" x="581" y="85"/>
      <operator activated="true" class="correlation_matrix" compatibility="7.1.001" expanded="true" height="103" name="Correlation Matrix (2)" width="90" x="581" y="289"/>
      <connect from_op="Generate Nominal Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Correlation Matrix" to_port="example set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Correlation Matrix (2)" to_port="example set"/>
      <connect from_op="Correlation Matrix" from_port="example set" to_port="result 1"/>
      <connect from_op="Correlation Matrix" from_port="matrix" to_port="result 2"/>
      <connect from_op="Correlation Matrix (2)" from_port="example set" to_port="result 3"/>
      <connect from_op="Correlation Matrix (2)" from_port="matrix" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Hope it helps as a start.

matrix - What method does rapidminer use to calculate correlation matrix and why am I getting negative correlations for two categorical/nominal attributes?

1 回答 1

Related

Reference