normalize vectors
In data mining tasks is the attributes normalizing usually a prerequisite in order to get a meaningful result. For example, if you want to calculate the euclidean distance, the attributes with relative large numerical values will have more influence of the result than the attributes with relative smaller values.
Formal of normalization:
normalized_value = (value – min)/(max – min), where
min: the minimum of the attribute
max: the maximum of the attribute
I wrote the 2 Versions of the normalization function, the first one the the map and lambda function, the second one use the numpy.tile() function and calculate the elements with the whole matrix. The first one is more concise and need less memory compare to the second ones.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def autoNorm(dataMatrix): | |
''' | |
normalize the data matrix | |
(Method: use the map() and lambda) | |
return: | |
normed data matrix: np.ndarray. value are in [0,1] | |
''' | |
minVals = dataMatrix.min(0)#Conny: In dataMatrix.min(0): 0 means column, 1 means line | |
maxVals = dataMatrix.max(0) | |
range = maxVals - minVals | |
dataMatrix = map(lambda record: (record - minVals)/range, dataMatrix) #Conny: The map() function return a list not a numpy matrix. | |
return np.matrix(dataMatrix) #Conny: numpy.matrix() turn the List to Matrix |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def autoNorm2(dataMatrix): | |
''' | |
normalize the data matrix | |
(Method: use the numpy.tile() and calculate with all the matrix) | |
return: | |
normed data matrix: np.ndarray. value are in [0,1] | |
''' | |
minVals = dataMatrix.min(0) | |
maxVals = dataMatrix.max(0) | |
ranges = maxVals - minVals | |
numberOfRecords = dataMatrix.shape[0] | |
normDataSet = dataMatrix - np.tile(minVals,(numberOfRecords, 1) ) | |
normDataSet = normDataSet/np.tile(ranges, (numberOfRecords, 1)) #conny: numpy.tile(vector, (numberOfLines, repeatTimes)) | |
return normDataSet |