Spark
Installation
Enabling SQLFrame
SQLFrame can be enabled by directly importing the sqlframe.spark package.
If converting a PySpark pipeline, all pyspark.sql should be replaced with sqlframe.spark.
In addition, many classes will have a Spark prefix.
For example, SparkDataFrame instead of DataFrame.
# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SQLFrame import
from sqlframe.spark import SparkSession
from sqlframe.spark import functions as F
from sqlframe.spark import SparkDataFrame
Creating a Session
SQLFrame's SparkSession is created the same way you would normally create a SparkSession. The configuration you apply to the builder will be applied the the SparkSession that SQLFrame will create.
from sqlframe.spark import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Now you can use SQLFrame
Example Usage
from sqlframe.spark import SparkSession
from sqlframe.spark import functions as F
spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = (
spark.table("insights.covid19_metrics_by_country")
.where(F.col("date").between("2021-01-01", "2021-12-31"))
.groupby("continent", "country")
.agg(
F.sum("total_case").alias("total_cases"),
F.sum("total_deaths").alias("total_deaths"),
(F.sum("total_deaths").cast("float") / F.nullif(F.sum("total_case"), F.lit(0))).alias("death_rate"),
)
.orderBy(F.col("death_rate").desc_nulls_last())
)
# or do `print(df.sql())` to print the SQL
df.show(5)
"""
+---------------+---------+-------------+--------------+---------------+
| continent | country | total_cases | total_deaths | death_rate |
+---------------+---------+-------------+--------------+---------------+
| Asia | Yemen | 2372121 | 475483 | 0.2004463516 |
| South America | Peru | 675530177 | 62217492 | 0.09210172116 |
| North America | Mexico | 1039445357 | 89674958 | 0.08627193089 |
| Africa | Sudan | 13109385 | 929258 | 0.07088494235 |
| Asia | Syria | 9963515 | 654570 | 0.06569669439 |
+---------------+---------+-------------+--------------+---------------+
"""
Catalog Class
- add_table
- SQLFrame Specific: Adds a table to known schemas that SQLFrame tracks
- currentCatalog
- currentDatabase
- databaseExists
- functionExists
- getDatabase
- getFunction
- getTable
- get_columns
- SQLFrame Specific: Similar to
listColumnsbut returns SQLGlot expressions instead
- SQLFrame Specific: Similar to
- get_columns_from_schema
- SQLFrame Specific: Gets the columns from the known schemas to SQLFrame
- listCatalogs
- listColumns
- listDatabases
- listFunctions
- listTables
- setCurrentCatalog
- setCurrentDatabase
- tableExists
Column Class
- alias
- alias
- asc
- asc_nulls_first
- asc_nulls_last
- between
- cast
- desc
- desc_nulls_first
- desc_nulls_last
- endswith
- ilike
- isNotNull
- isNull
- isin
- like
- otherwise
- over
- rlike
- sql
- SQLFrame Specific: Get the SQL representation of a given column
- startswith
- substr
- when
DataFrame Class
- agg
- alias
- approxQuantile
- cache
- coalesce
- collect
- columns
- copy
- corr
- count
- cov
- createOrReplaceTempView
- crossJoin
- cube
- distinct
- drop
- dropDuplicates
- drop_duplicates
- dropna
- exceptAll
- explain
- fillna
- filter
- first
- groupBy
- groupby
- head
- intersect
- intersectAll
- join
- limit
- lineage
- Get lineage for a specific column. Returns a SQLGlot Node. Can be used to get lineage SQL or HTML representation.
- na
- orderBy
- persist
- replace
- select
- show
- Vertical Argument is not Supported
- sort
- sql
- SQLFrame Specific: Get the SQL representation of a given DataFrame
- stat
- toDF
- toPandas
- union
- unionAll
- unionByName
- unpivot
- where
- withColumn
- withColumnRenamed
- write
Functions
All functions through 4.0 are supported. Please open an issue if you encounter a function not implemented.
GroupedData Class
DataFrameReader Class
DataFrameWriter Class
- csv
- insertInto
- json
- mode
- parquet
- save
- saveAsTable
- sql
- SQLFrame Specific: Get the SQL representation of the DataFrame
SparkSession Class
DataTypes
- ArrayType
- BinaryType
- BooleanType
- ByteType
- CharType
- DataType
- DateType
- DecimalType
- DoubleType
- FloatType
- IntegerType
- LongType
- Row
- ShortType
- StringType
- StructField
- StructType
- TimestampNTZType
- TimestampType
- VarcharType
Window Class
WindowSpec Class
- orderBy
- partitionBy
- rangeBetween
- rowsBetween
- sql
- SQLFrame Specific: Get the SQL representation of the WindowSpec