DuckDB

Installation

pip install "sqlframe[duckdb]"

Enabling SQLFrame

SQLFrame can be used in two ways:

Directly importing the sqlframe.duckdb package
Using the activate function to allow for continuing to use pyspark.sql but have it use SQLFrame behind the scenes.

Import

If converting a PySpark pipeline, all pyspark.sql should be replaced with sqlframe.duckdb. In addition, many classes will have a DuckDB prefix. For example, DuckDBDataFrame instead of DataFrame.

# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SQLFrame import
from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F
from sqlframe.duckdb import DuckDBDataFrame

Activate

If you would like to continue using pyspark.sql but have it use SQLFrame behind the scenes, you can use the activate function.

from sqlframe import activate
activate("duckdb")

from pyspark.sql import SparkSession

SparkSession will now be a SQLFrame DuckDBSession object and everything will be run on DuckDB directly.

See activate configuration for information on how to pass in a connection and config options.

Creating a Session

SQLFrame uses the duckdb package to connect to DuckDB. A DuckDBSession, which implements the PySpark Session API, can be created by passing in a duckdb.Connection object or by allowing SQLFrame to create a connection for you. By default, SQLFrame will create a connection to an in-memory database.

Import + Without Providing ConnectionImport + With Providing ConnectionActivate + Without Providing ConnectionActivate + With Providing Connection

from sqlframe.duckdb import DuckDBSession

session = DuckDBSession()

import duckdb
from sqlframe.duckdb import DuckDBSession

conn = duckdb.connect(database=":memory:")
session = DuckDBSession(conn=conn)

from sqlframe import activate
activate("duckdb")

from pyspark.sql import SparkSession

session = SparkSession.builder.getOrCreate()

import duckdb
from sqlframe import activate
conn = duckdb.connect(database=":memory:")
activate("duckdb", conn=conn)

from pyspark.sql import SparkSession

session = SparkSession.builder.getOrCreate()

Using DuckDB Unique Functions

DuckDB may have a function that isn't represented within the PySpark API. If that is the case, you can call it directly using PySpark call_function function.

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

session = DuckDBSession()
(
    session.table("example.table")
    .select(F.call_function("CURRENT_SETTING", F.lit("access_mode")).alias("access_mode_value"))
    .show()
)

Example Usage

from sqlframe.duckdb import DuckDBSession
from sqlframe.duckdb import functions as F

session = DuckDBSession()

df_employee = session.createDataFrame(
    [
        {"id": 1, "fname": "Jack", "lname": "Shephard", "age": 37, "store_id": 1},
        {"id": 2, "fname": "John", "lname": "Locke", "age": 65, "store_id": 2},
        {"id": 3, "fname": "Kate", "lname": "Austen", "age": 37, "store_id": 3},
        {"id": 4, "fname": "Claire", "lname": "Littleton", "age": 27, "store_id": 1},
        {"id": 5, "fname": "Hugo", "lname": "Reyes", "age": 29, "store_id": 3},
    ]
)
df_store = session.createDataFrame(
    [
        {"store_id": 1, "store_name": "The Hatch"},
        {"store_id": 2, "store_name": "The Pearl"},
        {"store_id": 3, "store_name": "The Swan"},
    ]
)

(
    df_employee
    .join(df_store, on="store_id")
    .groupBy("store_name")
    .agg(F.count("*").alias("total_employees"))
    .show()
)

Supported PySpark API Methods

See something that you would like to see supported? Open an issue!

Catalog Class

add_table
- SQLFrame Specific: Adds a table to known schemas that SQLFrame tracks
currentCatalog
currentDatabase
databaseExists
functionExists
getDatabase
getFunction
getTable
get_columns
- SQLFrame Specific: Similar to listColumns but returns SQLGlot expressions instead
get_columns_from_schema
- SQLFrame Specific: Gets the columns from the known schemas to SQLFrame
listCatalogs
listColumns
listDatabases
listFunctions
listTables
setCurrentCatalog
setCurrentDatabase
tableExists

Column Class

alias
alias
asc
asc_nulls_first
asc_nulls_last
between
cast
contains
desc
desc_nulls_first
desc_nulls_last
endswith
ilike
isNotNull
isNull
isin
like
otherwise
over
rlike
sql
- SQLFrame Specific: Get the SQL representation of a given column
startswith
when

DataFrame Class

agg
alias
approxQuantile
cache
coalesce
collect
columns
copy
corr
count
cov
createOrReplaceTempView
crossJoin
cube
distinct
drop
dropDuplicates
drop_duplicates
dropna
exceptAll
explain
fillna
filter
first
groupBy
groupby
head
intersect
intersectAll
join
limit
lineage
Get lineage for a specific column. Returns a SQLGlot Node. Can be used to get lineage SQL or HTML representation.
na
orderBy
persist
printSchema
replace
schema
select
show
- Vertical Argument is not Supported
sort
sql
- SQLFrame Specific: Get the SQL representation of the WindowSpec
stat
toArrow
SQLFrame Specific Argument: batch_size sets the number of rows to read per-batch and returns a RecordBatchReader
toDF
toPandas
union
unionAll
unionByName
unpivot
where
withColumn
withColumnRenamed
withColumnsRenamed
write

Functions

abs
acos
add_months
aggregate
any_value
- Always ignores nulls
approxCountDistinct
approx_count_distinct
array
array_compact
array_contains
array_distinct
array_except
array_intersect
array_join
array_max
array_min
array_prepend
array_position
array_remove
array_reverse
SQLFrame Specific: Functions like reverse but for only arrays
array_size
array_sort
array_union
arrays_overlap
arrays_zip
asc
asc_nulls_first
asc_nulls_last
ascii
asin
atan
atan2
avg
base64
bin
bit_and
bit_count
bit_length
bit_or
bit_xor
bitmap_bit_position
bitwiseNOT
bitwise_not
bool_and
bool_or
btrim
call_function
cbrt
ceil
ceiling
char
char_length
character_length
coalesce
col
collect_list
collect_set
collate
concat
- Only works on strings (does not work on arrays)
concat_ws
contains
Only works on strings (does not support binary)
convert_timezone
corr
cos
cot
count
countDistinct
count_distinct
count_if
covar_pop
covar_samp
create_map
cume_dist
curdate
current_catalog
current_date
current_time
current_timestamp
current_user
date_add
dateadd
date_diff
datediff
date_format
date_from_unix_date
date_sub
date_trunc
day
dayofmonth
dayofweek
dayofyear
dayname
decode
degrees
dense_rank
desc
desc_nulls_first
desc_nulls_last
e
element_at
- Only works on strings (does not work on arrays)
encode
endswith
exp
explode
expm1
expr
extract
factorial
filter
first
flatten
- If an array is none then it will be ignored and results still returned while PySpark will return None
floor
format_string
from_unixtime
get_json_object
- Values are returned quoted while Spark strips the quotes
greatest
grouping
grouping_id
hash
- Uses a different hash algorithm than Spark
hex
hour
initcap
input_file_name
instr
isnan
isnull
json_object_keys
kurtosis
lag
last
last_value
last_day
lcase
lead
least
left
length
levenshtein
like
lit
ln
localtimestamp
locate
log
log10
log1p
log2
lower
lpad
ltrim
make_date
map_from_arrays
map_keys
max
max_by
md5
mean
median
min
min_by
minute
mode
month
monthname
months_between
- Rounded whole number is returned
nanvl
nth_value
ntile
nullifzero
overlay
percent_rank
percentile
percentile_approx
product
position
pow
quarter
radians
rand
rank
reduce
regexp
regexp_extract
regexp_extract_all
regexp_like
regexp_replace
repeat
replace
reverse
- Only works on strings (does not work on arrays). Use SQLFrame specific array_reverse to reverse an array.
right
rint
rlike
round
row_number
rpad
rtrim
second
sequence
session_user
shiftLeft
shiftRight
shiftleft
shiftright
sign
signum
sin
size
skewness
slice
sort_array
soundex
split
split_part
sqrt
stddev
stddev_pop
stddev_samp
struct
substring
sum
sumDistinct
sum_distinct
tan
tanh
timestamp_add
timestamp_diff
timestamp_seconds
toDegrees
to_binary
to_date
to_timestamp
to_timestamp_ntz
to_unix_timestamp
- The values must match the format string (null will not be returned if they do not)
toRadians
transform
translate
trim
trunc
try_divide
try_element_at
try_to_timestamp
typeof
ucase
unbase64
unhex
unix_micros
unix_millis
unix_seconds
unix_seconds
unix_timestamp
uuid
upper
var_pop
var_samp
variance
weekofyear
when
year
zeroifnull

GroupedData Class

DataFrameReader Class

DataFrameWriter Class

csv
insertInto
json
mode
parquet
save
saveAsTable
sql
- SQLFrame Specific: Get the SQL representation of the DataFrame

SparkSession Class

DataTypes

Window Class

WindowSpec Class

orderBy
partitionBy
rangeBetween
rowsBetween
sql
- SQLFrame Specific: Get the SQL representation of the WindowSpec

Extra Functionality not Present in PySpark

SQLFrame supports the following extra functionality not in PySpark

Table Class

SQLFrame provides a Table class that supports extra DML operations like update and delete. This class is returned when using the table function from the DataFrameReader class.

import duckdb
from sqlframe.duckdb import DuckDBSession

conn = duckdb.connect(database=":memory:")
session = DuckDBSession(conn=conn)

df_employee = session.createDataFrame(
    [
        {"id": 1, "fname": "Jack", "lname": "Shephard", "age": 37, "store_id": 1},
        {"id": 2, "fname": "John", "lname": "Locke", "age": 65, "store_id": 2},
        {"id": 3, "fname": "Kate", "lname": "Austen", "age": 37, "store_id": 3},
        {"id": 4, "fname": "Claire", "lname": "Littleton", "age": 27, "store_id": 1},
        {"id": 5, "fname": "Hugo", "lname": "Reyes", "age": 29, "store_id": 3},
    ]
)

df_employee.write.mode("overwrite").saveAsTable("employee")

table_employee = session.table("employee")  # This object is of Type DuckDBTable

Update Statement

The update method of the Table class is equivalent to the UPDATE table_name statement used in standard sql.

# Generates a `LazyExpression` object which can be executed using the `execute` method
update_expr = table_employee.update(
    set_={"age": table_employee["age"] + 1},
    where=table_employee["id"] == 1,
)

# Executes the update statement
update_expr.execute()

# Show the result
table_employee.show()

Output:

+----+--------+-----------+-----+----------+
| id | fname  |   lname   | age | store_id | 
+----+--------+-----------+-----+----------+
| 1  |  Jack  |  Shephard |  38 |    1     |
| 2  |  John  |   Locke   |  65 |    2     |
| 3  |  Kate  |   Austen  |  37 |    3     |
| 4  | Claire | Littleton |  27 |    1     |
| 5  |  Hugo  |   Reyes   |  29 |    3     |
+----+--------+-----------+-----+----------+

Delete Statement

The delete method of the Table class is equivalent to the DELETE FROM table_name statement used in standard sql.

# Generates a `LazyExpression` object which can be executed using the `execute` method
delete_expr = table_employee.delete(
    where=table_employee["id"] == 1,
)

# Executes the delete statement
delete_expr.execute()

# Show the result
table_employee.show()

Output:

+----+--------+-----------+-----+----------+
| id | fname  |   lname   | age | store_id | 
+----+--------+-----------+-----+----------+
| 2  |  John  |   Locke   |  65 |    2     |
| 3  |  Kate  |   Austen  |  37 |    3     |
| 4  | Claire | Littleton |  27 |    1     |
| 5  |  Hugo  |   Reyes   |  29 |    3     |
+----+--------+-----------+-----+----------+