normalize vectors

In data mining tasks is the attributes normalizing usually a prerequisite in order to get a meaningful result. For example, if you want to calculate the euclidean distance, the attributes with relative large numerical values will have more influence of the result than the attributes with relative smaller values.

Formal of normalization:

normalized_value = (value – min)/(max – min), where

min: the minimum of the attribute

max: the maximum of the attribute

I wrote the 2 Versions of the normalization function, the first one the the map and lambda function, the second one use the numpy.tile() function and calculate the elements with the whole matrix. The first one is more concise and need less memory compare to the second ones.

import numpy as np
def autoNorm(dataMatrix):
'''
normalize the data matrix
(Method: use the map() and lambda)
return:
normed data matrix: np.ndarray. value are in [0,1]
'''
minVals = dataMatrix.min(0)#Conny: In dataMatrix.min(0): 0 means column, 1 means line
maxVals = dataMatrix.max(0)
range = maxVals - minVals
dataMatrix = map(lambda record: (record - minVals)/range, dataMatrix) #Conny: The map() function return a list not a numpy matrix.
return np.matrix(dataMatrix) #Conny: numpy.matrix() turn the List to Matrix
view raw autoNorm.py hosted with ❤ by GitHub
import numpy as np
def autoNorm2(dataMatrix):
'''
normalize the data matrix
(Method: use the numpy.tile() and calculate with all the matrix)
return:
normed data matrix: np.ndarray. value are in [0,1]
'''
minVals = dataMatrix.min(0)
maxVals = dataMatrix.max(0)
ranges = maxVals - minVals
numberOfRecords = dataMatrix.shape[0]
normDataSet = dataMatrix - np.tile(minVals,(numberOfRecords, 1) )
normDataSet = normDataSet/np.tile(ranges, (numberOfRecords, 1)) #conny: numpy.tile(vector, (numberOfLines, repeatTimes))
return normDataSet
view raw autoNorm2.py hosted with ❤ by GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.