Skip to content

Spark

Installation

pip install "sqlframe[spark]"

Enabling SQLFrame

SQLFrame can be enabled by directly importing the sqlframe.spark package. If converting a PySpark pipeline, all pyspark.sql should be replaced with sqlframe.spark. In addition, many classes will have a Spark prefix. For example, SparkDataFrame instead of DataFrame.

# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SQLFrame import
from sqlframe.spark import SparkSession
from sqlframe.spark import functions as F
from sqlframe.spark import SparkDataFrame

Creating a Session

SQLFrame's SparkSession is created the same way you would normally create a SparkSession. The configuration you apply to the builder will be applied the the SparkSession that SQLFrame will create.

from sqlframe.spark import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

# Now you can use SQLFrame

Example Usage

from sqlframe.spark import SparkSession
from sqlframe.spark import functions as F


spark = SparkSession.builder.appName("MyApp").getOrCreate()

df = (
    spark.table("insights.covid19_metrics_by_country")
    .where(F.col("date").between("2021-01-01", "2021-12-31"))
    .groupby("continent", "country")
    .agg(
        F.sum("total_case").alias("total_cases"),
        F.sum("total_deaths").alias("total_deaths"),
        (F.sum("total_deaths").cast("float") / F.nullif(F.sum("total_case"), F.lit(0))).alias("death_rate"),
    )
    .orderBy(F.col("death_rate").desc_nulls_last())
)
# or do `print(df.sql())` to print the SQL 
df.show(5)
"""
+---------------+---------+-------------+--------------+---------------+
|   continent   | country | total_cases | total_deaths |   death_rate  |
+---------------+---------+-------------+--------------+---------------+
|      Asia     |  Yemen  |   2372121   |    475483    |  0.2004463516 |
| South America |   Peru  |  675530177  |   62217492   | 0.09210172116 |
| North America |  Mexico |  1039445357 |   89674958   | 0.08627193089 |
|     Africa    |  Sudan  |   13109385  |    929258    | 0.07088494235 |
|      Asia     |  Syria  |   9963515   |    654570    | 0.06569669439 |
+---------------+---------+-------------+--------------+---------------+
"""

Catalog Class

Column Class

DataFrame Class

Functions

All functions through 4.0 are supported. Please open an issue if you encounter a function not implemented.

GroupedData Class

DataFrameReader Class

DataFrameWriter Class

SparkSession Class

DataTypes

Window Class

WindowSpec Class