Skip to content

Standalone

SQLFrame can be used in stand-alone mode which means it can do the query generation without needing access to a database. Since SQLFrame doesn't have access to a database to get column information, it needs to be explicitly provided instead. Also any operation that requires an actual connection to the database (like executing the query) will not for this session.

Installation

pip install sqlframe

Enabling SQLFrame

SQLFrame can be used in two ways:

  • Directly importing the sqlframe.standalone package
  • Using the activate function to allow for continuing to use pyspark.sql but have it use SQLFrame behind the scenes.

Import

If converting a PySpark pipeline, all pyspark.sql should be replaced with sqlframe.standalone. In addition, many classes will have a Standalone prefix. For example, StandaloneDataFrame instead of DataFrame.

# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SQLFrame import
from sqlframe.standalone import StandaloneSession
from sqlframe.standalone import functions as F
from sqlframe.standalone import StandaloneDataFrame

Activate

If you would like to continue using pyspark.sql but have it use SQLFrame behind the scenes, you can use the activate function.

from sqlframe import activate
activate("standalone")

from pyspark.sql import SparkSession

Creating a Session

Standalone supports defining both an input_dialect and an output_dialect which can be different from each other. input_dialect is the dialect used when using the DataFrame API and output_dialect is the dialect used when generating the SQL query. For example if you want Snowflake behavior of converting lowercase unquoted columns to uppercase, then you would set input_dialect to snowflake. If you plan on running the query against BigQuery, then you would set output_dialect to bigquery. Default is spark for both input and output dialects.

from sqlframe.standalone import StandaloneSession

session = StandaloneSession.builder.config(map={"sqlframe.input.dialect": 'duckdb', "sqlframe.output.dialect": 'bigquery'}).getOrCreate()
from sqlframe import activate
activate("standalone", config={"sqlframe.input.dialect": 'duckdb', "sqlframe.output.dialect": 'duckdb'})

from pyspark.sql import SparkSession
session = SparkSession.builder.getOrCreate()

Accessing Tables

PySpark DataFrame API, and currently SQLFrame, requires that a table can be access to get it's schema information. Other engine-specific SQLFrame Sessions, like DuckDBSession or BigQuerySession, do this at runtime automatically. However, in Standalone mode, you need to provide the schema information explicitly since it does not have this connection information.

from sqlframe.standalone import StandaloneSession

session = StandaloneSession()

session.catalog.add_table("test_table", column_mapping={"a": "int", "b": "int", "c": "int"})
df = session.read.table("test_table")
df.sql()

Output:

"""
SELECT
  `test_table`.`a` AS `a`,
  `test_table`.`b` AS `b`,
  `test_table`.`c` AS `c`
FROM `test_table` AS `test_table`
"""

Supported PySpark API Methods

See something that you would like to see supported? Open an issue!

Column Class

DataFrame Class

Functions

All functions through 4.0 are supported. Please open an issue if you encounter a function not implemented.

GroupedData Class

SparkSession Class

DataTypes

Window Class

WindowSpec Class