Skip to content

How to filter a Pandas series while writing to list

An answer to this question on Stack Overflow.

Question

For a given column 'A' in a pandas dataframe 'tbl', I have been trying to write the column .tolist() and filter out the items that are '.'. This works, but doesn't seem very readable:

list_of_A = tbl['A'][~tbl['A'].isin(['.'])].tolist()

Also, checking a list of one seems needlessly slow, though str.contains('.') seems like it would be slower due to pattern matching. Am I missing a better method?

Update: @jpp, @piRSquared and @Scott-Boston all had good methods, so it came down to a test:

>>> tbl = pd.DataFrame(np.random.randn(50000, 3), columns=list('ABC'))
>>> tbl.loc[tbl.sample(10000).index, 'A'] = '.'
>>> min(timeit.repeat("list_of_A = tbl.loc[tbl['A'].ne('.'), 'A'].tolist()", repeat=1000, number=100, globals=globals()))
0.37328900000102294
>>> min(timeit.repeat("list_of_A = tbl.A.values[tbl.A.values != '.'].tolist()", repeat=1000, number=100, globals=globals()))
0.1470019999997021
>>> min(timeit.repeat("tbl.query('A != \".\"')['A'].tolist()", repeat=1000, number=100, globals=globals()))
0.45748099999946135

Discovering .loc has opened up a world of possibilities for me, but to smash 'n' grab a filtered column list, looks like converting to an ndarray is fastest.

Answer

Using a temporary variable and two lines of code makes this more readable:

exclude   = tbl['A'].isin(['.'])
list_of_A = tbl['A'][~exclude].tolist()