Standalone

SQLFrame can be used in stand-alone mode which means it can do the query generation without needing access to a database. Since SQLFrame doesn't have access to a database to get column information, it needs to be explicitly provided instead. Also any operation that requires an actual connection to the database (like executing the query) will not for this session.

Installation

pip install sqlframe

Enabling SQLFrame

SQLFrame can be used in two ways:

Directly importing the sqlframe.standalone package
Using the activate function to allow for continuing to use pyspark.sql but have it use SQLFrame behind the scenes.

Import

If converting a PySpark pipeline, all pyspark.sql should be replaced with sqlframe.standalone. In addition, many classes will have a Standalone prefix. For example, StandaloneDataFrame instead of DataFrame.

# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SQLFrame import
from sqlframe.standalone import StandaloneSession
from sqlframe.standalone import functions as F
from sqlframe.standalone import StandaloneDataFrame

Activate

If you would like to continue using pyspark.sql but have it use SQLFrame behind the scenes, you can use the activate function.

from sqlframe import activate
activate("standalone")

from pyspark.sql import SparkSession

Creating a Session

Standalone supports defining both an input_dialect and an output_dialect which can be different from each other. input_dialect is the dialect used when using the DataFrame API and output_dialect is the dialect used when generating the SQL query. For example if you want Snowflake behavior of converting lowercase unquoted columns to uppercase, then you would set input_dialect to snowflake. If you plan on running the query against BigQuery, then you would set output_dialect to bigquery. Default is spark for both input and output dialects.

ImportActivate

from sqlframe.standalone import StandaloneSession

session = StandaloneSession.builder.config(map={"sqlframe.input.dialect": 'duckdb', "sqlframe.output.dialect": 'bigquery'}).getOrCreate()

from sqlframe import activate
activate("standalone", config={"sqlframe.input.dialect": 'duckdb', "sqlframe.output.dialect": 'duckdb'})

from pyspark.sql import SparkSession
session = SparkSession.builder.getOrCreate()

Accessing Tables

PySpark DataFrame API, and currently SQLFrame, requires that a table can be access to get it's schema information. Other engine-specific SQLFrame Sessions, like DuckDBSession or BigQuerySession, do this at runtime automatically. However, in Standalone mode, you need to provide the schema information explicitly since it does not have this connection information.

from sqlframe.standalone import StandaloneSession

session = StandaloneSession()

session.catalog.add_table("test_table", column_mapping={"a": "int", "b": "int", "c": "int"})
df = session.read.table("test_table")
df.sql()

Output:

"""
SELECT
  `test_table`.`a` AS `a`,
  `test_table`.`b` AS `b`,
  `test_table`.`c` AS `c`
FROM `test_table` AS `test_table`
"""

Supported PySpark API Methods

See something that you would like to see supported? Open an issue!

Column Class

alias
alias
asc
asc_nulls_first
asc_nulls_last
between
cast
desc
desc_nulls_first
desc_nulls_last
endswith
ilike
isNotNull
isNull
isin
like
otherwise
over
rlike
sql
- SQLFrame Specific: Get the SQL representation of a given column
startswith
substr
when

DataFrame Class

agg
alias
approxQuantile
cache
coalesce
collect
columns
copy
corr
count
cov
createOrReplaceTempView
crossJoin
cube
distinct
drop
dropDuplicates
drop_duplicates
dropna
exceptAll
explain
fillna
filter
first
groupBy
groupby
head
intersect
intersectAll
join
limit
lineage
Get lineage for a specific column. Returns a SQLGlot Node. Can be used to get lineage SQL or HTML representation.
na
orderBy
persist
replace
select
show
- Vertical Argument is not Supported
sort
sql
- SQLFrame Specific: Get the SQL representation of a given DataFrame
stat
toDF
toPandas
union
unionAll
unionByName
unpivot
where
withColumn
withColumnRenamed
write

Functions

All functions through 4.0 are supported. Please open an issue if you encounter a function not implemented.

GroupedData Class

SparkSession Class

DataTypes

Window Class

WindowSpec Class

orderBy
partitionBy
rangeBetween
rowsBetween
sql
- SQLFrame Specific: Get the SQL representation of the WindowSpec