We adopted a practical Medallion-style approach to structure our data flows—segmenting data flows into Bronze, Silver, and Gold layers. As part of this redesign, we needed to optimize how curated data was exported to Snowflake. That’s when we hit performance issues with external tables. I know the common suggestion is to use the COPY
command—but it comes with limitations. It lacks support for schema evolution, doesn’t offer native parallel copy for all formats, and its cost gets buried inside compute usage, offering no granular cost visibility.
Even though Snowflake remained our primary analytical engine, we wanted to export or offload curated data back to Snowflake to leverage its performance for downstream analytics.
Our priorities were practical: support near real-time ingestion, avoid manual schema updates through schema evolution, ensure visibility through monitoring, handle the complexity of frequently refreshing files from the source system, and build a system with minimal tech debt and low ongoing maintenance overhead. Snowpipe offered the automation we needed, but it brought its own unique challenges and nuances.
This blog shares our hands-on experience—what worked, what didn’t, and how we designed a pipeline that’s now running reliably in production.
If you’re unfamiliar with the Medallion architecture and how it structures data flows across Bronze, Silver, and Gold layers, check out this detailed primer: The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time. It’s a foundational pattern that shaped our approach to scalable data ingestion and transformation.
Table of Contents
Architecture Choices and Initial Decisions
Quick Questions During POC
During our initial evaluation of Snowpipe, we documented several key implementation questions and their answers. Here’s a summary of the most frequently asked questions we encountered while validating the feasibility of Snowpipe for our use case:
Question | Observation / Answer |
---|---|
Automated or Manual? | Snowpipe supports automated ingestion using S3 event notifications and SQS. Manual triggering is also possible. |
Setup Overhead? | Simple to configure. You create a Snowpipe linked to a stage and SQS. Setup takes only a few minutes. |
Performance? | Near real-time with low ingestion latency. Snowpipe handles distributed and parallel loading efficiently. |
Managed Table vs External Table Performance? | Managed tables are significantly faster due to result caching. Queries on external tables showed much slower performance. |
File Format Support? | Supports structured (CSV) and semi-structured (JSON, Parquet, Avro) formats. We used Parquet and JSON heavily. |
Success Notifications? | Not natively supported. Workarounds involve COPY_HISTORY and stage comparisons. |
Error Notifications? | Can be pushed to SNS. Needs notification integration in Snowflake and downstream consumers. |
Cost Visibility? | Cost is tied to compute resource usage and number of files. Roughly $0.06 per 1000 files of 10–100MB each. |
Error Handling Options? | ON_ERROR options include SKIP_FILE, CONTINUE, SKIP_FILE_10, and SKIP_FILE_10%. Use SKIP_FILE for maximum safety. |
Schema Evolution Support? | Yes, by setting ENABLE_SCHEMA_EVOLUTION = TRUE on the table. Columns are added automatically. |
Reloading a File with Same Name? | Not possible within 14 days due to Snowpipe metadata. Needs unique filenames or a delay. |
Historical Data Load? | Snowpipe supports up to 7 days of history via ALTER PIPE REFRESH . Beyond that, manual COPY commands are needed. |
S3 File Deletion Impact? | Snowpipe does not handle deletes by default. Requires stream-task workflows or stage comparisons to reflect deletions. |
Stage Purge Support? | Files are not auto-deleted. Use S3 lifecycle policies or explicit purge logic. |
Multiple Tables with One SQS? | Supported. A single SQS queue can service multiple tables as long as stage paths are partitioned distinctly. |
These early insights helped shape our implementation strategy and avoided pitfalls that are often only discovered in production.
Why Didn’t We Choose Apache Iceberg?
About 1.5 years ago, we were at a crossroads while designing our ingestion pipeline: should we invest in Apache Iceberg or stick with Snowflake-native capabilities like Snowpipe?
While Iceberg is an excellent open table format, it wasn’t the right fit at that time for several reasons:
-
Snowflake’s support for Iceberg was limited. Native integration had only just begun and required workarounds.
-
Schema evolution in Iceberg was rudimentary compared to Snowflake’s seamless handling.
-
Operational complexity—especially for a small team—was significantly higher with Iceberg.
To understand what makes Apache Iceberg powerful and where it shines today, check out my deep-dive blog post: Apache Iceberg: The Data Lake Breakthrough That’s Reshaping the Big Data Landscape
Our decision favored Snowflake’s managed experience and focus on automation—even if it meant dealing with certain limitations more creatively. That choice paid off in our production systems.
The Promise and Challenge of Snowpipe
Snowpipe is a managed ingestion service in Snowflake designed to load files as soon as they land in your cloud storage. It supports micro-batch ingestion and near real-time visibility of data.
Key features that stood out to us:
-
Auto-ingestion with S3 event triggers.
-
Parallel, distributed data copy into managed tables.
-
Native integration with Snowflake Streams and Tasks.
-
Support for structured and semi-structured formats (like JSON, Parquet).
-
Schema evolution via table property: We enabled
ENABLE_SCHEMA_EVOLUTION = TRUE
at the table level. This allowed incoming files to introduce new columns without needing engineers to update the schema manually. It was a huge win for agility.
However, Snowpipe also brought a few constraints:
-
Its append-only nature means it won’t reload the same file again.
-
Deletes and updates require orchestration outside Snowpipe.
-
Monitoring and observability must be built using metadata and system views.
Solution Architecture and Components
Snowpipe as a Smart Email Inbox: An Analogy You Won’t Forget
Think of your Snowflake ingestion pipeline like a smart email system—organized, reactive, and automatic.
Stage |
Email Analogy |
What It Actually Does |
---|---|---|
Landing Table |
Your inbox |
Every incoming email (file) lands here. It’s append-only. |
Stream |
New mail notification |
Instantly detects unread messages — no polling. |
Task |
Smart filters or auto-label rules |
Triggered when new mail arrives — no fixed schedule. |
Stored Procedure |
Rules engine to clean + tag mail |
Removes spam, identifies duplicates, adds tags, etc. |
Clone Table |
Archive folders with shared access |
Cleaned, organized emails saved here — easy to query safely. |
Why it works:
-
It’s event-driven, not scheduled.
-
Snowflake Streams work like notifications – they fire only when needed.
-
The system cleans and enriches incoming data just like how Gmail tags and filters mail automatically.
Although this architecture may look complex, setting it up involves just a few SQL statements – creating the stage, pipe, stream, task, and wrapping it in a stored procedure. Once automated, it’s surprisingly lightweight to operate.
Snowpipe POC Environment Setup
For those looking to experiment or validate this architecture, here’s how to quickly set up a working POC environment in Snowflake.
Step-by-Step SQL Commands for POC
-- Step 1: Create a managed table
CREATE OR REPLACE TABLE stg.nik_table_snowpipe_employee_test_parquet (
first_name VARCHAR,
last_name VARCHAR,
date_of_joining VARCHAR,
filename STRING
);
-- Step 2: Enable schema evolution
ALTER TABLE stg.nik_table_snowpipe_employee_test_parquet
SET ENABLE_SCHEMA_EVOLUTION = TRUE;
-- Step 3: Create an external stage pointing to your S3 location
CREATE OR REPLACE STAGE stg.nik_stage_snowpipe_employee_test_parquet
STORAGE_INTEGRATION = 'INTEGRATION_NAME'
URL = 's3://your-bucket-name/snowpipe/snowpipe_parquet'
FILE_FORMAT = (
TYPE = 'parquet',
COMPRESSION = 'AUTO',
NULL_IF = ('NULL', 'null', '\N','None','NONE')
);
-- Step 4: Create the Snowpipe
CREATE OR REPLACE PIPE stg.nik_snowpipe_employee_test_parquet
AUTO_INGEST = TRUE
AS
COPY INTO stg.nik_table_snowpipe_employee_test_parquet
FROM @stg.nik_stage_snowpipe_employee_test_parquet/latest/part_year=2025
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
-- Step 5: Manually refresh Snowpipe (for first-time load or testing)
ALTER PIPE stg.nik_snowpipe_employee_test_parquet REFRESH;
-- Optional: Check loaded data
SELECT * FROM stg.nik_table_snowpipe_employee_test_parquet;
✅ Note: Replace the S3 bucket path and integration name with your actual resources.
This setup will allow you to simulate file ingestion via Snowpipe, validate schema evolution, and test how metadata like filename
is captured for downstream deduplication or partitioning.
Designing a Resilient Snowpipe Architecture
-
Deduplication stored procedure: We implemented a JavaScript-based stored procedure that encapsulates complex deduplication logic, including pausing the Snowpipe—a crucial step since Snowpipe limits concurrent executions. Pausing ensures we avoid data races while deduplicating and gives us better control over ingestion frequency (e.g., running every 60 seconds). This not only improves consistency but also optimizes cost by reducing unnecessary micro-batch triggers, refreshing stages, detecting and deleting old or duplicate files, and finally cloning cleaned data into a final table. This approach minimizes downtime and automates clean ingestion.
Sample Stored Procedure for Deduplication
Before jumping into the code, here are some important assumptions and design considerations tied to this deduplication logic:
-
Partitioning Strategy: In our implementation, all incoming files follow a common partitioning convention using paths like
part_year=YYYY/part_month=MM/...
. This allowed us to extract and embed these values dynamically into the final deduplicated table using computed columns. -
Metadata Fields Are Available: The
filename
,file_last_modified
, andfile_scan_time
fields are critical. These are either automatically captured by Snowflake (with newCOPY
options) or appended during ingestion. -
Time-based Deduplication Window: We scoped duplicate detection to the last 7–8 days of data, under the assumption that only recent files could be refreshed. This helped us scale efficiently without full-table scans.
-
Pausing the Pipe Temporarily: This ensures Snowpipe doesn’t interfere with deduplication. Snowpipe’s concurrency limitations could otherwise lead to data inconsistencies if new files land mid-procedure.
-
Final Table Cloning Strategy: Instead of overwriting production tables, we used a
CLONE
command to create a deduplicated snapshot. This provides a safe rollback option and preserves user access rights seamlessly.
These assumptions helped generalise the solution while remaining adaptable to different table structures or source systems.
CREATE OR REPLACE PROCEDURE SNOWPIPE_LN_TABLE_DEDUP_SP(
DELAY STRING,
SF_DB STRING,
SF_SCHEMA STRING,
SF_DB_STAGE STRING,
INSTANCE_NAME STRING,
OUTPUT_FORMAT STRING,
STREAM_NAME STRING,
LN_TBL_NAME STRING,
SNWPIPE_NAME STRING,
FINAL_TBL_NAME STRING,
TMP_TBL STRING,
PART_YEAR STRING DEFAULT 'func_year',
PART_MONTH STRING DEFAULT 'func_month',
PART_DAY STRING DEFAULT 'func_day',
PART_HOUR STRING DEFAULT 'func_hour'
)
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
try {
var log = '';
snowflake.execute({ sqlText: 'USE DATABASE ' + SF_DB });
snowflake.execute({ sqlText: 'USE SCHEMA ' + SF_SCHEMA });
snowflake.execute({ sqlText: 'ALTER PIPE ' + SNWPIPE_NAME + ' SET PIPE_EXECUTION_PAUSED = TRUE;' });
snowflake.execute({ sqlText: 'SELECT SYSTEM$WAIT(' + DELAY + ', \''SECONDS\');' });
snowflake.execute({ sqlText: `ALTER STAGE ${SF_DB_STAGE} REFRESH SUBPATH = 'data/${INSTANCE_NAME}.${OUTPUT_FORMAT}';` });
snowflake.execute({ sqlText: 'DROP TABLE IF EXISTS ' + TMP_TBL });
var dedupQuery = `
CREATE OR REPLACE TEMP TABLE ${TMP_TBL} AS (
WITH recent_modified_paths AS (
SELECT REPLACE(filename, SPLIT_PART(filename, '/', -1), '') AS path
FROM ${LN_TBL_NAME}
WHERE file_last_modified >= DATEADD(day, -8, CURRENT_DATE)
GROUP BY REPLACE(filename, SPLIT_PART(filename, '/', -1), '')
),
all_files AS (
SELECT f.filename, r.path, f.file_last_modified, f.file_scan_time
FROM ${LN_TBL_NAME} f
JOIN recent_modified_paths r
ON REPLACE(f.filename, SPLIT_PART(f.filename,'/',-1), '') = r.path
WHERE f.filename IS NOT NULL
GROUP BY f.filename, r.path, f.file_last_modified, f.file_scan_time
),
duplicates AS (
SELECT filename, file_last_modified, file_scan_time,
CASE WHEN LAG(filename, 1) OVER (PARTITION BY path ORDER BY filename, file_last_modified DESC, file_scan_time DESC) = filename THEN TRUE ELSE FALSE END AS is_duplicate_and_old_file,
DATEDIFF(second, file_last_modified, FIRST_VALUE(file_last_modified) OVER (PARTITION BY path ORDER BY file_last_modified DESC, file_scan_time DESC)) AS time_diff_secs,
DENSE_RANK() OVER (PARTITION BY path ORDER BY file_last_modified DESC, file_scan_time DESC) AS rn
FROM all_files
QUALIFY rn > 1
)
SELECT * FROM duplicates
WHERE is_duplicate_and_old_file = FALSE AND time_diff_secs > 1800
UNION ALL
SELECT * FROM duplicates
WHERE is_duplicate_and_old_file = TRUE
);`;
snowflake.execute({ sqlText: dedupQuery });
snowflake.execute({
sqlText: `DELETE FROM ${LN_TBL_NAME} WHERE (filename, file_last_modified, file_scan_time) IN (
SELECT filename, file_last_modified, file_scan_time FROM ${TMP_TBL});`
});
snowflake.execute({ sqlText: 'DROP TABLE IF EXISTS ' + TMP_TBL });
snowflake.execute({ sqlText: `CREATE OR REPLACE TABLE ${FINAL_TBL_NAME} CLONE ${LN_TBL_NAME};` });
var alterQuery = `
ALTER TABLE ${FINAL_TBL_NAME} ADD COLUMN IF NOT EXISTS ${PART_YEAR} STRING AS
CASE WHEN POSITION('${PART_YEAR}=' IN filename) = 0 THEN NULL
ELSE SPLIT_PART(SPLIT_PART(filename, '${PART_YEAR}=', -1), '/', 0) END,
IF NOT EXISTS ${PART_MONTH} STRING AS
CASE WHEN POSITION('${PART_MONTH}=' IN filename) = 0 THEN NULL
ELSE SPLIT_PART(SPLIT_PART(filename, '${PART_MONTH}=', -1), '/', 0) END,
IF NOT EXISTS ${PART_DAY} STRING AS
CASE WHEN POSITION('${PART_DAY}=' IN filename) = 0 THEN NULL
ELSE SPLIT_PART(SPLIT_PART(filename, '${PART_DAY}=', -1), '/', 0) END,
IF NOT EXISTS ${PART_HOUR} STRING AS
CASE WHEN POSITION('${PART_HOUR}=' IN filename) = 0 THEN NULL
ELSE SPLIT_PART(SPLIT_PART(filename, '${PART_HOUR}=', -1), '/', 0) END;`;
snowflake.execute({ sqlText: alterQuery });
snowflake.execute({ sqlText: 'ALTER PIPE ' + SNWPIPE_NAME + ' SET PIPE_EXECUTION_PAUSED = FALSE;' });
return 'Procedure completed successfully.';
} catch (err) {
return 'Procedure failed. Error: ' + err.message;
}
$$;
-
Deduplication stored procedure: We implemented a JavaScript-based stored procedure that encapsulates complex deduplication logic, including pausing the Snowpipe, refreshing stages, detecting and deleting old or duplicate files, and finally cloning cleaned data into a final table. This approach minimizes downtime and automates clean ingestion.
To solve these limitations, we designed a Snowpipe-centric ingestion framework that layered in flexibility, observability, and control.
Here’s what the stack included:
-
S3 event notifications: Trigger Snowpipe on new file arrivals.
-
Directory Tables in Snowflake: This is handled automatically by Snowflake to maintain metadata of ingested files.
-
Streams in Snowflake: Monitor changes to staging tables.
-
Tasks in Snowflake: Automate deduplication workflows.
-
Stored Procedures: House complex business logic.
-
Clone Tables: Provide stable, deduplicated snapshots to consumers and save headache of preserving permissions.
-
Schema evolution enabled at table-level, reducing manual overhead.
Dynamic Tables in Snowflake (Evaluated but Deferred)
Dynamic Tables are Snowflake’s attempt at simplifying incremental transformations. We explored using them for deduplication logic. However:
-
Lack of schema evolution support was a blocker.
-
Limited join flexibility made them less suitable for complex use cases.
-
Better suited for downstream transformations, not raw ingestion cleanup.
We expect to revisit them once the ecosystem matures further.
Handling Refreshing Files, Deduplication, and Deletes
A Near-Give-Up Moment: When Snowpipe Almost Didn’t Make the Cut
There was a point during the implementation where we seriously considered shelving Snowpipe altogether. The lack of native metadata capture—such as filename and load timestamps—made deduplication logic brittle and overly reliant on external mechanisms. We needed a more resilient way to uniquely identify files ingested into the managed table.
Fortunately, Snowflake released a major update just in time. With the April 30–May 07, 2024 — 8.17 release, Snowpipe began supporting a new COPY
option that automatically includes metadata like filenames, file modification timestamps, and more. This fundamentally simplified our deduplication logic.
It allowed us to embed filename metadata directly into the target table without relying on fragile stage queries or external joins. This single improvement significantly boosted pipeline resilience and reduced engineering overhead.
Strategy 1: COPY_HISTORY + Temp Table
We leveraged Snowflake’s COPY_HISTORY()
function to track all files ingested by Snowpipe. Here’s how we managed deduplication:
-
Extract all versions of a file in a given partition.
-
Identify the most recent version using
DENSE_RANK()
. -
Delete older versions from the managed table.
-
Drop the temp table to maintain hygiene.
Strategy 2: Stream + Task + Stored Procedure
One of the biggest challenges we overcame was deduplicating data at scale—even in pipelines ingesting millions of records daily. To manage this effectively:
-
We used Streams to track only new data arriving into the managed landing table, which drastically reduced the processing footprint.
-
The Stored Procedure was built with robust partition-aware logic that filtered only the recently modified files (typically from the last 7–8 days), avoiding unnecessary scans over older partitions.
-
Metadata columns like
filename
,file_last_modified
, andfile_scan_time
helped us surgically identify outdated duplicates. -
A window function-based approach using
DENSE_RANK
andLAG
ensured precise detection of duplicate records by partition. -
The deduplication logic operated on a temporary table, allowing fast intermediate processing without impacting production tables.
-
We wrapped all logic in an event-driven model with Snowflake Tasks, ensuring real-time cleanup without manual triggers.
These strategies combined allowed our platform to scale gracefully, with near-zero lag and no compromise on correctness—even as ingestion volume surged across multiple data domains.
We moved toward a more event-driven architecture:
-
Stream detects inserts into the managed table.
-
Stored Procedure runs deduplication logic using partition + filename.
-
Task is triggered when the stream detects new rows.
-
Pause Snowpipe temporarily to avoid race conditions and resume it post-cleanup.
This gave us minimal latency between ingestion and cleanup, making it production-ready for near real-time systems.
Monitoring, Notifications, and Historical Data
Snowflake Monitoring and Observability for Snowpipe
Debugging ingestion pipelines without observability is a nightmare. So we built:
-
Snowsight dashboards for real-time views on load time, lag, and failures.
-
Logging inside Stored Procedures to trace dedup steps.
-
Error tables to isolate unprocessed files.
Snowpipe Success and Error Notifications
Snowpipe doesn’t natively support real-time success notifications. Our workarounds included:
-
JOINs on COPY_HISTORY + STAGE metadata to identify unmatched files.
-
SQS-based notification pipeline to alert on pending or failed files.
For errors, Snowpipe does allow SNS hooks. We subscribed email and Slack endpoints to stay on top of issues—especially critical during production cutovers.
Historical Data Handling in Snowpipe
Snowpipe only supports up to 7 days of metadata for a refresh. For older data:
-
We used manual COPY statements from external stages.
-
Built a templated query system to ingest specific files as needed.
-
Planned to automate this via Airflow + S3 inventory.
Lessons Learned and Final Thoughts
Key Lessons from Running Snowpipe in Production
-
Schema evolution support saved us significant manual effort. By enabling
ENABLE_SCHEMA_EVOLUTION = TRUE
, we ensured that any newly added fields in the incoming files would be reflected in the target table without needing constant engineering intervention. This made the pipeline flexible and aligned with agile data development practices. -
Clone tables helped maintain data consistency while preserving access rights. Instead of overwriting production tables directly after deduplication, we created clones to reflect the cleaned data. This approach provided a snapshot of trusted data without breaking existing grants or roles. Clone tables in Snowflake inherit access privileges by default, which meant consumers could continue querying without disruption, even as the underlying data evolved.
-
It gives us a consistent, query-optimized view of cleaned data.
-
-
Access control must be granular to allow tasks to function independently.
-
Error handling must be centralized, not scattered.
-
Dedup logic must evolve with partition changes.
And most importantly: never assume append-only ingestion is enough if your upstream system isn’t truly append-only.
Final Thoughts: Scaling Snowpipe Successfully
What started as a standard Snowpipe implementation evolved into a robust ingestion architecture. We didn’t just load data—we added safety nets, observability layers, schema flexibility, and logic that respects the realities of modern data pipelines.
The key takeaway? Snowpipe remains a powerful ingestion mechanism—though in 2025, it’s starting to face serious competition from Apache Iceberg – but to make it production-grade, you need thoughtful design and real-world validation. It wasn’t just about getting data in, it was about making it resilient, maintainable, and repeatable. What started as a POC is now a core part of our platform’s ingestion fabric.
If you’re exploring ingestion at scale in Snowflake, especially for frequently refreshed datasets, we hope this guide helps you build not just a pipeline—but a platform.
Stay tuned !!
#Snowpipe #Snowflake #StreamingData #RealTimeIngestion #SnowflakeStreams #SnowflakeTasks #S3 #DataEngineering #DataPipeline #DataIngestion #SchemaEvolution #SnowflakeMonitoring #SnowflakeTutorial