E.g.

X= [[1,0,0,0,1], [1,0,0,1,1],[1,0,0,0,0],[1,0,0,1,1]]

The output should be

X= [[1,0,1], [1,1,1],[1,0,0],[1,1,1]]

+1 vote

Best answer

I am not sure about any function that can do it in one line, but there are several ways to do this. In the following code, I am removing those columns that value 1 for less than <60% rows.

>>> X

<3x3 sparse matrix of type '<type 'numpy.int32'>'

with 6 stored elements in Compressed Sparse Row format>

>>> X.toarray()

array([[1, 0, 1],

[0, 0, 1],

[1, 1, 1]])

>>> row,col=X.nonzero() #get row an col where value is not zero

>>> row

array([0, 0, 1, 2, 2, 2])

>>> col

array([0, 2, 2, 0, 1, 2])

>>> t=[]

>>> for v,vv in scipy.stats.itemfreq(col): #If >60% rows have 1, select those cols.

... if(float(vv)/len(np.unique(row)) > 0.6):

... t.append(v)

...

>>> t

[0, 2]

>>> X[:,t]

<3x2 sparse matrix of type '<type 'numpy.int32'>'

with 5 stored elements in Compressed Sparse Row format>

>>> X[:,t].toarray()

array([[1, 1],

[0, 1],

[1, 1]])

>>> X

<3x3 sparse matrix of type '<type 'numpy.int32'>'

with 6 stored elements in Compressed Sparse Row format>

>>> X.toarray()

array([[1, 0, 1],

[0, 0, 1],

[1, 1, 1]])

>>>

If you want to remove cols with value 0 for all rows:

>>> X=np.array([[1,0,0,0,1], [1,0,0,1,1],[1,0,0,0,0],[1,0,0,1,1]])

>>> X

array([[1, 0, 0, 0, 1],

[1, 0, 0, 1, 1],

[1, 0, 0, 0, 0],

[1, 0, 0, 1, 1]])

>>> row,col=np.nonzero(X)

>>> c=np.unique(col)

>>> c

array([0, 3, 4], dtype=int64)

>>> X[:,c]

array([[1, 0, 1],

[1, 1, 1],

[1, 0, 0],

[1, 1, 1]])