Stabilizing the Spatial Lakehouse: Configuring Spark 3.5.5, Iceberg, and Nessie

At LurraData Lab, our core operations heavily rely on processing geospatial data. We require a robust, distributed architecture to handle everything from Earth Observation satellite imagery (for agriculture and environment) to vector data analytics. Our stack of choice is Apache Spark combined with Apache Sedona.

Recently, we set out to build a modern Data Lakehouse architecture using Apache Iceberg for table formatting and Project Nessie for Git-like catalog versioning.

The Bleeding Edge: Spark 4.0.1 Wall

Our initial ambition was to future-proof the lab by deploying the latest Spark 4.0.1 alongside Sedona and Nessie 0.107+.

However, we quickly hit a wall. While the individual projects are fantastic, the cross-compatibility ecosystem for Spark 4.x is still catching up. Specifically, we encountered a lack of compiled connector components and dependency conflicts between Sedona’s spatial libraries and Nessie’s catalog requirements on the Spark 4 JVM.

The Pivot: In data engineering, stability trumps novelty. We decided to reconfigure and step back to the Long-Term Support (LTS) ecosystem: Spark 3.5.5 and Nessie 0.104.x.

Baseline Testing with Alex Merced’s Tutorial

To validate this new configuration and ensure catalog operations (branching, merging) worked flawlessly, we used a great tutorial by Alex Merced from Dremio:A Notebook for Getting Started with Project Nessie, Apache Iceberg, and Apache Spark as our reference baseline.

A huge thanks to Alex for putting that together—it’s a brilliant starting point. However, as software evolves rapidly, we had to make a few tweaks to get it running on our specific 3.5.5 stack.

The Working Configuration

Here is the exact PySpark session configuration (executed in Jupyter Lab) that successfully binds Spark, Sedona, Iceberg, and Nessie together:

from pyspark.sql import SparkSession
import os

# Define Nessie and Iceberg versions
NESSIE_VERSION = "0.104.0"
ICEBERG_VERSION = "1.5.0"

spark = SparkSession.builder \
    .appName("LurraLab-Nessie-Iceberg-Sedona") \
    .config("spark.jars.packages", 
            f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
            f"org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:{NESSIE_VERSION}") \
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,"
            "org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1") \
    .config("spark.sql.catalog.nessie.ref", "main") \
    .config("spark.sql.catalog.nessie.authentication.type", "NONE") \
    .config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
    .config("spark.sql.catalog.nessie.warehouse", "file:///home/jovyan/workspace/warehouse") \
    .getOrCreate()

The Missing Steps

While following Alex’s tutorial to test the DDL/DML operations, we encountered two minor roadblocks that we had to resolve:

1. Namespace Creation

In newer versions of Nessie/Iceberg, you cannot create a table directly in the catalog root without explicitly defining a namespace first. The original tutorial skipped this. Before creating the sales table, you must run:

# Crucial missing step in modern Iceberg/Nessie configs
spark.sql("CREATE NAMESPACE IF NOT EXISTS nessie.sales_namespace")

(And subsequently, adjust your table creation scripts to reference nessie.sales_namespace.sales).

2. The Missing salesdata.csv

The original dataset linked in the tutorial was no longer available. To save future engineers some time, here is a mock snippet of the salesdata.csv we used to successfully test the ingestion:

transaction_id,store_id,amount,date
1001,A1,250.50,2026-03-01
1002,B2,15.00,2026-03-01
1003,A1,99.99,2026-03-02
1004,C3,1200.00,2026-03-02
1005,B2,45.50,2026-03-03

Validation and Next Steps

With the namespace created and the dummy data in place, Alex’s branching and merging examples worked perfectly. We successfully created a dev branch, ingested the CSV, and merged it back to main with full isolation. What’s Next for LurraData Lab? Now that the foundational Data Lakehouse is stable on Spark 3.5.5, our next step is to overlay Apache Sedona. We will be ingesting our EkoGuarda wildfire GeoJSON datasets into Iceberg tables and utilizing Nessie branches to experiment with different spatial indexing strategies without breaking our main production data. Stay tuned for the spatial benchmarks.

Citations

https://github.com/ThaliaBarrera/lakehouse-capstone https://community.dremio.com/t/write-in-dremio-nessie-catalog-through-spark/12146?page=2

A Notebook for Getting Started with Project Nessie, Apache Iceberg, and Apache Spark By Alex Merced

“A step-by-step guide to configuring your local environment to test Iceberg table formats with Nessie catalog branching…”