PySpark Cheat Sheet

In Python

Pyspark is a powerful open-source data processing framework that allows developers to work with large datasets in a distributed computing environment. It is built on top of Apache Spark, a fast and general-purpose cluster computing system that provides in-memory data processing capabilities.

Pyspark provides a Python API for Spark, which makes it easy for developers to write Spark applications using Python. It offers a wide range of features, including support for SQL queries, machine learning algorithms, graph processing, and streaming data processing.

One of the key benefits of Pyspark is its ability to handle large datasets efficiently. It can distribute data across multiple nodes in a cluster, allowing for parallel processing of data. This makes it ideal for big data applications, where traditional data processing tools may struggle to handle the volume of data.

This cheat sheet provides a quick reference for PySpark commands and functions. It is divided into different tables based on the themes.

Table of Contents

DataFrames

CommandDescription
spark.read.csv(path)Read a CSV file into a DataFrame
df.show()Display the first 20 rows of a DataFrame
df.printSchema()Print the schema of a DataFrame
df.select(col)Select a column from a DataFrame
df.filter(condition)Filter rows based on a condition
df.groupBy(col)Group rows by a column
df.orderBy(col)Sort rows by a column
df.join(other_df, on)Join two DataFrames on a column
df.withColumn(col, expr)Add a new column to a DataFrame
df.drop(col)Drop a column from a DataFrame
df.na.fill(value)Fill missing values with a specified value
df.write.csv(path)Write a DataFrame to a CSV file

SQL

CommandDescription
spark.sql(query)Execute a SQL query on a DataFrame
df.createOrReplaceTempView(view_name)Create a temporary view of a DataFrame
spark.catalog.listTables()List all tables in the current database
spark.catalog.dropTempView(view_name)Drop a temporary view

Aggregation

CommandDescription
df.count()Count the number of rows in a DataFrame
df.sum(col)Calculate the sum of a column
df.avg(col)Calculate the average of a column
df.min(col)Find the minimum value of a column
df.max(col)Find the maximum value of a column
df.agg(expr)Perform multiple aggregations on a DataFrame

Joins

CommandDescription
df.join(other_df, on)Join two DataFrames on a column
df.crossJoin(other_df)Perform a cross join between two DataFrames
df.union(other_df)Combine two DataFrames with the same schema
df.intersect(other_df)Find the common rows between two DataFrames
df.except(other_df)Find the rows in the first DataFrame that are not in the second DataFrame

Window Functions

CommandDescription
from pyspark.sql.window import WindowImport the Window class
Window.partitionBy(col)Define the partitioning column for a window
Window.orderBy(col)Define the ordering column for a window
Window.rowsBetween(start, end)Define the range of rows for a window
df.withColumn(col, expr)Add a new column to a DataFrame
df.selectExpr(expr)Select columns and apply expressions
df.selectExpr(""avg(col) OVER (PARTITION BY partition_col ORDER BY order_col ROWS BETWEEN start AND end)"")Apply a window function to a DataFrame

Machine Learning

CommandDescription
from pyspark.ml.feature import VectorAssemblerImport the VectorAssembler class
VectorAssembler(inputCols, outputCol)Combine multiple columns into a single vector
from pyspark.ml.classification import LogisticRegressionImport the LogisticRegression class
LogisticRegression(featuresCol, labelCol)Train a logistic regression model
from pyspark.ml.evaluation import BinaryClassificationEvaluatorImport the BinaryClassificationEvaluator class
BinaryClassificationEvaluator(labelCol, rawPredictionCol)Evaluate the performance of a binary classification model

Reference:

https://spark.apache.org/docs/latest/api/python/