Spark - SELECT WHERE or filtering?

What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?

When do I use

DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10))

and when is

DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10")

more appropriate?

2 Answers

According to spark documentation "where() is an alias for filter()"

filter(condition)Filters rows using the given condition.where() is an alias for filter().

Parameters: condition – a Column of types.BooleanType or a string of SQL expression.

>>> df.filter(df.age > 3).collect()
[Row(age=5, name=u'Bob')]
>>> df.where(df.age == 2).collect()
[Row(age=2, name=u'Alice')]
>>> df.filter("age > 3").collect()
[Row(age=5, name=u'Bob')]
>>> df.where("age = 2").collect()
[Row(age=2, name=u'Alice')]

As Yaron mentioned, there isn't any difference between where and filter.

filter is an overloaded method that takes a column or string argument. The performance is the same, regardless of the syntax you use.

We can use explain() to see that all the different filtering syntaxes generate the same Physical Plan. Suppose you have a dataset with person_name and person_country columns. All of the following code snippets will return the same Physical Plan below:

df.where("person_country = 'Cuba'").explain()
df.where($"person_country" === "Cuba").explain()
df.where('person_country === "Cuba").explain()
df.filter("person_country = 'Cuba'").explain()

These all return this Physical Plan:

== Physical Plan ==
*(1) Project [person_name#152, person_country#153]
+- *(1) Filter (isnotnull(person_country#153) && (person_country#153 = Cuba)) +- *(1) FileScan csv [person_name#152,person_country#153] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark2/src/test/re..., PartitionFilters: [], PushedFilters: [IsNotNull(person_country), EqualTo(person_country,Cuba)], ReadSchema: struct<person_name:string,person_country:string>

The syntax doesn't change how filters are executed under the hood, but the file format / database that a query is executed on does. Spark will execute the same query differently on Postgres (predicate pushdown filtering is supported), Parquet (column pruning), and CSV files. See here for more details.

Star Pop News

Spark - SELECT WHERE or filtering?

2 Answers

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

How do I get to the second floor of the Lecture Building?

Order of Endings in NG for Sekiro

How to get the new minecraft launcher GUI

How can I safely pilot the Cyclops?