PySpark Cheat Sheet | Cheatsheetindex

Pyspark is a powerful open-source data processing framework that allows developers to work with large datasets in a distributed computing environment. It is built on top of Apache Spark, a fast and general-purpose cluster computing system that provides in-memory data processing capabilities.

Pyspark provides a Python API for Spark, which makes it easy for developers to write Spark applications using Python. It offers a wide range of features, including support for SQL queries, machine learning algorithms, graph processing, and streaming data processing.

One of the key benefits of Pyspark is its ability to handle large datasets efficiently. It can distribute data across multiple nodes in a cluster, allowing for parallel processing of data. This makes it ideal for big data applications, where traditional data processing tools may struggle to handle the volume of data.

This cheat sheet provides a quick reference for PySpark commands and functions. It is divided into different tables based on the themes.

DataFrames
SQL
Aggregation
Joins
Window Functions
Machine Learning

DataFrames

Command	Description
`spark.read.csv(path)`	Read a CSV file into a DataFrame
`df.show()`	Display the first 20 rows of a DataFrame
`df.printSchema()`	Print the schema of a DataFrame
`df.select(col)`	Select a column from a DataFrame
`df.filter(condition)`	Filter rows based on a condition
`df.groupBy(col)`	Group rows by a column
`df.orderBy(col)`	Sort rows by a column
`df.join(other_df, on)`	Join two DataFrames on a column
`df.withColumn(col, expr)`	Add a new column to a DataFrame
`df.drop(col)`	Drop a column from a DataFrame
`df.na.fill(value)`	Fill missing values with a specified value
`df.write.csv(path)`	Write a DataFrame to a CSV file

SQL

Command	Description
`spark.sql(query)`	Execute a SQL query on a DataFrame
`df.createOrReplaceTempView(view_name)`	Create a temporary view of a DataFrame
`spark.catalog.listTables()`	List all tables in the current database
`spark.catalog.dropTempView(view_name)`	Drop a temporary view

Aggregation

Command	Description
`df.count()`	Count the number of rows in a DataFrame
`df.sum(col)`	Calculate the sum of a column
`df.avg(col)`	Calculate the average of a column
`df.min(col)`	Find the minimum value of a column
`df.max(col)`	Find the maximum value of a column
`df.agg(expr)`	Perform multiple aggregations on a DataFrame

Joins

Command	Description
`df.join(other_df, on)`	Join two DataFrames on a column
`df.crossJoin(other_df)`	Perform a cross join between two DataFrames
`df.union(other_df)`	Combine two DataFrames with the same schema
`df.intersect(other_df)`	Find the common rows between two DataFrames
`df.except(other_df)`	Find the rows in the first DataFrame that are not in the second DataFrame

Window Functions

Command	Description
`from pyspark.sql.window import Window`	Import the Window class
`Window.partitionBy(col)`	Define the partitioning column for a window
`Window.orderBy(col)`	Define the ordering column for a window
`Window.rowsBetween(start, end)`	Define the range of rows for a window
`df.withColumn(col, expr)`	Add a new column to a DataFrame
`df.selectExpr(expr)`	Select columns and apply expressions
`df.selectExpr(""avg(col) OVER (PARTITION BY partition_col ORDER BY order_col ROWS BETWEEN start AND end)"")`	Apply a window function to a DataFrame

Machine Learning

Command	Description
`from pyspark.ml.feature import VectorAssembler`	Import the VectorAssembler class
`VectorAssembler(inputCols, outputCol)`	Combine multiple columns into a single vector
`from pyspark.ml.classification import LogisticRegression`	Import the LogisticRegression class
`LogisticRegression(featuresCol, labelCol)`	Train a logistic regression model
`from pyspark.ml.evaluation import BinaryClassificationEvaluator`	Import the BinaryClassificationEvaluator class
`BinaryClassificationEvaluator(labelCol, rawPredictionCol)`	Evaluate the performance of a binary classification model

Reference:

https://spark.apache.org/docs/latest/api/python/

Table of Contents

DataFrames

SQL

Aggregation

Joins

Window Functions

Machine Learning

Related Posts

Streamlit Cheat Sheet

NumPy Cheat Sheet

Python Regex Cheat Sheet

Python Data Structures Cheat Sheet