Standalone
SQLFrame can be used in stand-alone mode which means it can do the query generation without needing access to a database. Since SQLFrame doesn't have access to a database to get column information, it needs to be explicitly provided instead. Also any operation that requires an actual connection to the database (like executing the query) will not for this session.
Installation
Enabling SQLFrame
SQLFrame can be used in two ways:
- Directly importing the
sqlframe.standalonepackage - Using the activate function to allow for continuing to use
pyspark.sqlbut have it use SQLFrame behind the scenes.
Import
If converting a PySpark pipeline, all pyspark.sql should be replaced with sqlframe.standalone.
In addition, many classes will have a Standalone prefix.
For example, StandaloneDataFrame instead of DataFrame.
# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SQLFrame import
from sqlframe.standalone import StandaloneSession
from sqlframe.standalone import functions as F
from sqlframe.standalone import StandaloneDataFrame
Activate
If you would like to continue using pyspark.sql but have it use SQLFrame behind the scenes, you can use the activate function.
Creating a Session
Standalone supports defining both an input_dialect and an output_dialect which can be different from each other.
input_dialect is the dialect used when using the DataFrame API and output_dialect is the dialect used when generating the SQL query.
For example if you want Snowflake behavior of converting lowercase unquoted columns to uppercase, then you would set input_dialect to snowflake.
If you plan on running the query against BigQuery, then you would set output_dialect to bigquery.
Default is spark for both input and output dialects.
Accessing Tables
PySpark DataFrame API, and currently SQLFrame, requires that a table can be access to get it's schema information. Other engine-specific SQLFrame Sessions, like DuckDBSession or BigQuerySession, do this at runtime automatically. However, in Standalone mode, you need to provide the schema information explicitly since it does not have this connection information.
from sqlframe.standalone import StandaloneSession
session = StandaloneSession()
session.catalog.add_table("test_table", column_mapping={"a": "int", "b": "int", "c": "int"})
df = session.read.table("test_table")
df.sql()
Output:
"""
SELECT
`test_table`.`a` AS `a`,
`test_table`.`b` AS `b`,
`test_table`.`c` AS `c`
FROM `test_table` AS `test_table`
"""
Supported PySpark API Methods
See something that you would like to see supported? Open an issue!
Column Class
- alias
- alias
- asc
- asc_nulls_first
- asc_nulls_last
- between
- cast
- desc
- desc_nulls_first
- desc_nulls_last
- endswith
- ilike
- isNotNull
- isNull
- isin
- like
- otherwise
- over
- rlike
- sql
- SQLFrame Specific: Get the SQL representation of a given column
- startswith
- substr
- when
DataFrame Class
- agg
- alias
- approxQuantile
- cache
- coalesce
- collect
- columns
- copy
- corr
- count
- cov
- createOrReplaceTempView
- crossJoin
- cube
- distinct
- drop
- dropDuplicates
- drop_duplicates
- dropna
- exceptAll
- explain
- fillna
- filter
- first
- groupBy
- groupby
- head
- intersect
- intersectAll
- join
- limit
- lineage
- Get lineage for a specific column. Returns a SQLGlot Node. Can be used to get lineage SQL or HTML representation.
- na
- orderBy
- persist
- replace
- select
- show
- Vertical Argument is not Supported
- sort
- sql
- SQLFrame Specific: Get the SQL representation of a given DataFrame
- stat
- toDF
- toPandas
- union
- unionAll
- unionByName
- unpivot
- where
- withColumn
- withColumnRenamed
- write
Functions
All functions through 4.0 are supported. Please open an issue if you encounter a function not implemented.
GroupedData Class
SparkSession Class
DataTypes
- ArrayType
- BinaryType
- BooleanType
- ByteType
- CharType
- DataType
- DateType
- DecimalType
- DoubleType
- FloatType
- IntegerType
- LongType
- Row
- ShortType
- StringType
- StructField
- StructType
- TimestampNTZType
- TimestampType
- VarcharType
Window Class
WindowSpec Class
- orderBy
- partitionBy
- rangeBetween
- rowsBetween
- sql
- SQLFrame Specific: Get the SQL representation of the WindowSpec