# Archive for March, 2015

# Use the np.array.argsort() and list.sort() to get the rank of the number

I have a list and I want to get the rank for each member. This is important for many data mining algorithm, for example kNN (k-Nearest-Neighbor), you want to know the rank of a certain record.

For example:

myList = [2, 1, 7, 3, 6]

My expected result:

rank =[1, 0, 4, 2, 3]

Take care, I mean the “rank” not the index of the argument.

Solution 1:

Step 1: Get the sorted index of myList

- Method 1: use the numpy.array.argsort() to get a list of returned index

np.array(myList).argsort()

Out: array([1,0,3,4,2]) - Method 2: use the numpy.array.sort(key=…)

indices = range[5]

indices.sort(key=myList.__getitem__)

Out: array([1,0,3,4,2])

Step 2: Turn the index to the rank

- indices = np.array(myList).argsort()

[indices.tolist().index(i) for i in range(5)]

Solution 2:

You can do all the things in one strike:

- ranking = [sorted(list).index(each) for each in list]

# sort 2d array in python

There are many ways to sort a 2d arrays.

In general, you can use the sorted(), np.array.sort(),…

But you should also understand

- what is meaning of list1.sort(list2)
- ravel() vs. flatten()
- what return the key function back in sorted()
- what will be returned by np.lexsort(), np.argsort()

# the crazy * in zip()

I just want to reshape the List without use the numpy, so I found a piece of crazy code in stackoverflow:

l = ['a', 'b', 'c', 'd', 'e', 'f'] zip(*[iter(l)]*2)

output: [(‘a’, ‘b’), (‘c’, ‘d’), (‘e’, ‘f’)]

After the long consideration, I understand the code at last. The python programmer are really …

# the asterik * in python

Asterik in python is usually very confused for the beginner. Some times it appears one time before a variable like *args, some times it appears two times like **args.

Usually it appears in function definition:

def func(*args1, **args2): pass

* and ** allow arbitrary number of arguments. Here the *args is somehow like a tuple of arguments, **kwargs is like a dictionary of arguments.

def func(*args, **kwargs): for member in args: print member for member in kwargs: print member,"\t", kwargs[member]

let see the *args, call the function

func(1,2,3)

you will get the output:

1

2

3

you can also call the function with *args, the type of args should be a tuple or a list, * means decompose the tuple or list,

args = (1,2,3) func(1,2,3)

you will also get the same output

1

2

3

Analog to **kwargs

func(name = "pangpang", age = "12", hobby = "sleeping")

you will get the output:

hobby sleeping

age 12

name pangpang

This is equivalent to

kwargs = {"name":"pangpang","age":"12", "hobby":"sleeping"} func(**kwargs)

you will get the same output like above.

I put the code in Gist, you can download and try it.

# normalize vectors

In data mining tasks is the attributes normalizing usually a prerequisite in order to get a meaningful result. For example, if you want to calculate the euclidean distance, the attributes with relative large numerical values will have more influence of the result than the attributes with relative smaller values.

Formal of normalization:

normalized_value = (value – min)/(max – min), where

min: the minimum of the attribute

max: the maximum of the attribute

I wrote the 2 Versions of the normalization function, the first one the the map and lambda function, the second one use the numpy.tile() function and calculate the elements with the whole matrix. The first one is more concise and need less memory compare to the second ones.

# Solve the path problem in eclipse pydev

If you copy a bundle of python codes into eclipse, usually the pydev can not find the paths. In this case, you have 2 two ways to deal with this problem.

- right click the folder -> pydev -> Set as Source Folder (add to python path)
- insert the path with code: sys.path.insert(0, “your file path”) or sys.path.append(“your file path”)

for example: sys.path.insert(0, “../smartphone”)

# Different ways to calculate the euclidean distance in python

There are already many ways to do the euclidean distance in python, you don’t need to do it actually. But it is a very good exercise for programming as long as you do it by yourself.

summary:

- no numpy
- numpy.dot(vector.T, vector)
- numpy.linalg.norm(vector, order, axis)
- scipy.spatial.distance.euclidean(vector1, vector2)
- matplotlib.pyplot.mlab.dist()
- It is also important to care about the Error Handling, this is useful link to learn deal with errors.

I wrote the following script, just try it!

import numpy as np import sys import math import matplotlib.pyplot as plt from scipy.spatial import distance vector1 = [3.0, 104.0] vector2 = [18.0, 90.0] vectorOther = [1.0, 2.0, 3.0] v1 = np.array(vector1) v2 = np.array(vector2) vOther = np.array(vectorOther) def diff_Length_Error(): raise RuntimeWarning("The length of the two vectors are not the same!") def euclidean0_0 (vector1, vector2): ''' calculate the euclidean distance input: numpy.arrays or lists return: 1. quard distance, 2. euclidean distance ''' quar_distance = 0 try: if(len(vector1) != len(vector2)): diff_Length_Error() zipVector = zip(vector1, vector2) for member in zipVector: quar_distance += (member[1] - member[0]) ** 2 return quar_distance, math.sqrt(quar_distance) except Exception, err: sys.stderr.write('WARNING: %s\n' % str(err)) return -1, -1 def euclidean0_1(vector1, vector2): '''calculate the euclidean distance, no numpy input: numpy.arrays or lists return: euclidean distance ''' dist = [(a - b)**2 for a, b in zip(vector1, vector2)] dist = math.sqrt(sum(dist)) return dist def euclidean2(vector1, vector2): '''calculate the euclidean distance, use numpy.dot() function input: numpy.arrays or lists return: euclidean distance ''' try: if type(vector1) == list: vector1 = np.array(vector1) if type(vector2) == list: vector2 = np.array(vector2) diff = vector2 - vector1 squareDistance = np.dot(diff.T, diff) return squareDistance, math.sqrt(squareDistance) except TypeError as e: print "Type error: {}".format(e.message) raise except ValueError as e: print "Value error: {}".format(e.message) raise except: print "Unexpected error:", sys.exc_info()[0] raise def euclidean3(vector1, vector2): ''' use numpy.linalg.norm to calculate the euclidean distance. ''' vector1, vector2 = list_to_npArray(vector1, vector2) distance = np.linalg.norm(vector1-vector2, 2, 0) # the third argument "0" means the column, and "1" means the line. return distance def euclidean4(vector1, vector2): ''' use scipy to calculate the euclidean distance. ''' dist = distance.euclidean(vector1, vector2) return dist def euclidean5(vector1, vector2): ''' use matplotlib.mlab to calculate the euclidean distance. ''' vector1, vector2 = list_to_npArray(vector1, vector2) dist = plt.mlab.dist(vector1, vector2) return dist def list_to_npArray(vector1, vector2): '''convert the list to numpy array''' if type(vector1) == list: vector1 = np.array(vector1) if type(vector2) == list: vector2 = np.array(vector2) return vector1, vector2