Sunday, May 13, 2007

Pearson's Correlation Coefficient

Reference: http://davidmlane.com/hyperstat/A51911.html

- Designated by r
- Correlation reflects the linear relationship between two variables.
- -1 <= r <=+1
(negative linear relationship to positivie linear relationship)
- 0: no linear correlation
- r = sum of product of z-scores/N

Labels:

Clustering Techniques Taxonomy

Techniques:

1. Hierarchical Techniques:
(A) Agglomerative:
Start with N clusters, end have one cluster.
1. Nearest-Neighbor/Single link method
Distance between groups is the distance between their closest pairs.

2. Furthest-neighbor/Complete linkage
Distance between groups is the distance between their most remote pairs.

3. Centroid Cluster Analysis
Groups defined by the distance between their centroids.

4. Median Cluster Analysis
Unlke Centroid, ignore the group size by assuming groups are of equal size. Works only for distance measures.

5. Group Average Method
Distance between groups is the average of the distances between all pairs of individuals in two groups.

6. Ward's Method
Join clusters that result in the minimum increase in the error sum of squares are combined.

(B) Divisive
Start with one, end with N clusters.
1. Monothetic
Based on the possession or otherwise of a single specified attribute. Based on binary data.
A. Association Analysis
B. Automatic Interaction Detector Method (A.I.D.)

2. Polythetic
Methods based on the values taken by all the attributes.

2. Optimization-partitioning techniques
- clusters formed by optimizing a clustering criterion
- unlike hierarchical, the clusters of entities can change. Poor partitioning can be corrected at a later stage.

Typical steps:
* Initiate Clusters/Selecting initial clusters
* Allocate entities to the initiated clusters
* Reallocate entities to other clusters

Selecting/Initial Clusters:
- random selection
- sets up clster centers regularly spaced at intervals of one standard deviation on each variable
- use prior knowledge

Reallocation:
- consider each entity in turn for reassignment to another cluster
- decide based on optimization criterion
- continue till stabiliztion
Issue: Local minima

3. Density or mode seeking techniques
- clusters formed by searching for regions containing a relatively dense concentration of entities

4. Clumping techniques
- clusters overlap

Labels: