Subset dask dataframe by column position

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.

from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)

What I would like to do:

in_memory = ddf.iloc[:,2:4].compute()

What I have been able to do:

ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()

map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.

1 Answer

Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:

cols = list(ddf.columns[2:4])
ddf[cols].compute()

This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.

Star Pop News

Subset dask dataframe by column position

What I would like to do:

What I have been able to do:

1 Answer

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

What happens if Shepard died in Mass Effect 2?

How accurate is the new appraisal system?

Where is the light coming from?

How much food does a colonist eat each Sol?