Archive for March, 2015


Use the np.array.argsort() and list.sort() to get the rank of the number

I have a list and I want to get the rank for each member. This is important for many data mining algorithm, for example kNN (k-Nearest-Neighbor), you want to know the rank of a certain record.

For example:

myList = [2, 1, 7, 3, 6]

My expected result:

rank =[1, 0, 4, 2, 3]

Take care, I mean the “rank” not the index of the argument.

Solution 1:

Step 1: Get the sorted index of myList

  • Method 1: use the numpy.array.argsort() to get a list of returned index
    Out: array([1,0,3,4,2])
  • Method 2: use the numpy.array.sort(key=…)
    indices = range[5]
    Out: array([1,0,3,4,2])

Step 2: Turn the index to the rank

  • indices = np.array(myList).argsort()
    [indices.tolist().index(i) for i in range(5)]

Solution 2:

You can do all the things in one strike:

  • ranking = [sorted(list).index(each) for each in list]

sort 2d array in python

There are many ways to sort a 2d arrays.

In general, you can use the sorted(), np.array.sort(),…

But you should also understand

  • what is meaning of list1.sort(list2)
  • ravel() vs. flatten()
  • what return the  key function back in sorted()
  • what will be returned by np.lexsort(), np.argsort()


the crazy * in zip()

I just want to reshape the List without use the numpy, so I found a piece of crazy code in stackoverflow:

l = ['a', 'b', 'c', 'd', 'e', 'f']

output: [(‘a’, ‘b’), (‘c’, ‘d’), (‘e’, ‘f’)]

After the long consideration, I understand the code at last. The python programmer are really …


the asterik * in python

Asterik in python is usually very confused for the beginner.  Some times it appears one time before a variable like *args, some times it appears two times like **args.

Usually it appears in function definition:

def func(*args1, **args2):

* and ** allow arbitrary number of arguments. Here the *args is somehow like a tuple of arguments, **kwargs is like a dictionary of arguments.

def func(*args, **kwargs):
    for member in args:
        print member
    for member in kwargs:
        print member,"\t", kwargs[member]

let see the *args, call the function


you will get the output:

you can also call the function with *args, the type of args should be a tuple or a list, * means decompose the tuple or list,

args = (1,2,3)

you will also get the same output

Analog to **kwargs

func(name = "pangpang", age = "12", hobby = "sleeping")

you will get the output:

hobby sleeping
age 12
name pangpang

This is equivalent to

kwargs = {"name":"pangpang","age":"12", "hobby":"sleeping"}

you will get the same output like above.

I put the code in Gist, you can download and try it.


normalize vectors

In data mining tasks is the attributes normalizing usually a prerequisite in order to get a meaningful result. For example, if you want to calculate the euclidean distance, the attributes with relative large numerical values will have more influence of the result than the attributes with relative smaller values.

Formal of normalization:

normalized_value = (value – min)/(max – min), where

min: the minimum of the attribute

max: the maximum of the attribute

I wrote the 2 Versions of the normalization function, the first one the the map and lambda function, the second one use the numpy.tile() function and calculate the elements with the whole matrix. The first one is more concise and need less memory compare to the second ones.


Solve the path problem in eclipse pydev

If you copy a bundle of python codes into eclipse, usually the pydev can not find the paths. In this case, you have 2 two ways to deal with this problem.

  1. right click the folder -> pydev -> Set as Source Folder (add to python path)
  2. insert the path with code: sys.path.insert(0, “your file path”) or sys.path.append(“your file path”)
    for example: sys.path.insert(0, “../smartphone”)

Different ways to calculate the euclidean distance in python

There are already many ways to do the euclidean distance in python, you don’t need to do it actually. But it is a very good exercise for programming as long as you do it by yourself.


I wrote the following script, just try it!

import numpy as np
import sys
import math
import matplotlib.pyplot as plt
from scipy.spatial import distance

vector1 = [3.0, 104.0]
vector2 = [18.0, 90.0]
vectorOther = [1.0, 2.0, 3.0]

v1 = np.array(vector1)
v2 = np.array(vector2)
vOther = np.array(vectorOther)

def diff_Length_Error():
    raise RuntimeWarning("The length of the two vectors are not the same!")

def euclidean0_0 (vector1, vector2):
''' calculate the euclidean distance
    input: numpy.arrays or lists
    return: 1. quard distance, 2. euclidean distance
    quar_distance = 0
        if(len(vector1) != len(vector2)):
        zipVector = zip(vector1, vector2)

        for member in zipVector:
            quar_distance += (member[1] - member[0]) ** 2

        return quar_distance, math.sqrt(quar_distance)

    except Exception, err:
        sys.stderr.write('WARNING: %s\n' % str(err))
        return -1, -1

def euclidean0_1(vector1, vector2):
    '''calculate the euclidean distance, no numpy
    input: numpy.arrays or lists
    return: euclidean distance
    dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
    dist = math.sqrt(sum(dist))
    return dist

def euclidean2(vector1, vector2):
    '''calculate the euclidean distance, use function
    input: numpy.arrays or lists
    return: euclidean distance
        if type(vector1) == list:
            vector1 = np.array(vector1)
        if type(vector2) == list:
            vector2 = np.array(vector2)
        diff = vector2 - vector1
        squareDistance =, diff)
        return squareDistance, math.sqrt(squareDistance)
    except TypeError as e:
        print "Type error: {}".format(e.message)
    except ValueError as e:
        print "Value error: {}".format(e.message)
        print "Unexpected error:", sys.exc_info()[0]

def euclidean3(vector1, vector2):
    ''' use numpy.linalg.norm to calculate the euclidean distance. '''
    vector1, vector2 = list_to_npArray(vector1, vector2)
    distance = np.linalg.norm(vector1-vector2, 2, 0) # the third argument "0" means the column, and "1" means the line.
    return distance

def euclidean4(vector1, vector2):
    ''' use scipy to calculate the euclidean distance. '''
    dist = distance.euclidean(vector1, vector2)
    return dist

def euclidean5(vector1, vector2):
    ''' use matplotlib.mlab to calculate the euclidean distance. '''
    vector1, vector2 = list_to_npArray(vector1, vector2)
    dist = plt.mlab.dist(vector1, vector2)
    return dist

def list_to_npArray(vector1, vector2):
    '''convert the list to numpy array'''
    if type(vector1) == list:
        vector1 = np.array(vector1)
    if type(vector2) == list:
        vector2 = np.array(vector2)
    return vector1, vector2