Data Engineering·June 25, 2026·13 min read

AWS Athena Query Optimization: Scan Less, Pay Less

Q: How is Amazon Athena priced?

On the standard model, Athena charges about $5 per terabyte of data scanned, rounded up to the nearest 10 MB with a 10 MB minimum per query. DDL statements and failed queries are free, and identical re-runs can be served from cache at no cost via query result reuse. There's also a provisioned-capacity option billed per DPU-hour for steady, heavy workloads.

AWS Athena query optimization comes down to one thing: scanning less data. How partitioning, Parquet, bucketing and projection cut what you scan, and the bill.

Amazon AthenaQuery OptimizationCost OptimizationPartitioningParquetAWS

Amazon Athena makes it easy to forget you're paying for anything. There's no cluster to size, nothing left running overnight, so the cost stays invisible until the monthly bill lands. And when a query costs more than it should, it's almost never because the SQL was clever or complicated. It's because the query read far more data than it had to.

Which is the whole secret to AWS Athena query optimization. Athena prices on one thing for most teams: the volume of data a query scans, at roughly $5 per terabyte. Not the rows you get back. Not how long the query runs. The bytes it reads off S3. I run Athena as the query layer over a production S3 data lake I built for a sales-engagement platform, and nearly every speed or cost win there comes back to the same move: make each query touch less data.

The one number that sets your Athena bill

On the standard pricing model, Athena bills the bytes a query scans, rounded up, with a 10 MB minimum per query. DDL statements are free. Failed queries are free. And if you re-run an identical query, query result reuse can serve it from cache for $0.

Athena charges for what it reads off S3, not what it returns. Every technique below is a way to shrink the highlighted box.

That framing changes how you think about a slow or pricey query. You're not tuning a database engine. You're trying to put less data in front of it. So the highest-leverage decisions all happen at the storage layer, before a single line of SQL runs. That is where the real wins live, and it's the part most guides rush past on the way to "use LIMIT."

Partition so queries skip what they don't need

Partitioning splits a table into folders on S3 by a column you filter on a lot, usually a date. When a query filters on that column, Athena reads only the matching folders and ignores the rest. A year of logs partitioned by day means a single-day query reads one folder instead of 365.

partition_and_prune.sql

-- Table partitioned by event date
CREATE EXTERNAL TABLE events (
  user_id string,
  action  string,
  payload string
)
PARTITIONED BY (event_date string)
STORED AS PARQUET
LOCATION 's3://my-lake/events/';
 
-- GOOD: filters the partition column, so Athena prunes to one day
SELECT action, count(*)
FROM events
WHERE event_date = '2026-06-23'   -- reads one partition, not the whole table
GROUP BY action;

The catch is that pruning only happens when you filter on the partition column itself. Filter on a normal column and Athena still scans everything, then throws most of it away. That single mistake is behind a huge share of "but I partitioned it, why is it still slow" tickets.

The other trap is the opposite of too little: partitioning so finely that you create millions of tiny files. Partition a low-volume table by hour and you get thousands of near-empty folders, and the overhead of listing them eats the savings. Partition by the grain your queries actually filter on, and no finer.

Store it as Parquet, not CSV

Partitioning decides which files Athena opens. Columnar storage decides how much of each file it reads. CSV and JSON are row formats, so a query for three columns still drags every column off disk. Parquet stores data by column, so Athena reads only the three you asked for and skips the rest. Add compression and the bytes scanned fall off a cliff.

Same data, three-column query	Stored size	Bytes scanned	Cost / query
CSV, uncompressed	~100 GB	~100 GB	~$0.50
CSV + gzip	~30 GB	~30 GB	~$0.15
Parquet + Snappy	~15 GB	~1–3 GB	~$0.005–0.015

Representative figures from published AWS and community benchmarks; your ratios depend on column count and data. The direction is the reliable part.

Converting is a one-statement job with CTAS, and you can partition on the way out:

csv_to_parquet.sql

CREATE TABLE events_parquet
WITH (
  format = 'PARQUET',
  parquet_compression = 'SNAPPY',
  partitioned_by = ARRAY['event_date'],
  external_location = 's3://my-lake/events-parquet/'
) AS
SELECT user_id, action, payload, event_date
FROM events_csv;

If you make one change from this whole article, make it this one. Switching a hot table from CSV to compressed Parquet routinely cuts data scanned by 90% or more, which means a query that cost dollars now costs cents, and runs faster while it's at it.

Keep files in the 128 MB–1 GB range

There's a quieter cost between partitioning and format: file size. Athena parallelises across files, so a handful of giant files leaves workers idle, and a swarm of tiny files buries the query in open-and-close overhead. Streaming and frequent micro-batches love to produce the second kind, thousands of few-kilobyte objects that are death by a thousand reads.

The sweet spot is files in the 128 MB to 1 GB range. If an upstream job is spraying tiny files, compact them on a schedule with a CTAS or INSERT INTO that rewrites a partition into fewer, larger Parquet files. It's unglamorous housekeeping, and it's often the difference between a query that feels instant and one that stalls for no obvious reason.

Bucketing: the lever most guides skip

Partitioning works when you filter on a low-cardinality column like a date. It falls apart on something like user_id, where partitioning would create a folder per user. That's what bucketing is for. You hash a high-cardinality column into a fixed number of buckets, and Athena can prune to the right bucket on a filter and line buckets up cheaply on a join.

bucketing.sql

CREATE TABLE orders_bucketed
WITH (
  format = 'PARQUET',
  bucketed_by = ARRAY['customer_id'],
  bucket_count = 64,
  external_location = 's3://my-lake/orders-bucketed/'
) AS
SELECT * FROM orders;

Use it for the columns you join or filter on that have too many distinct values to partition. Skip it when cardinality is low (partition instead) or when the table is small enough that it won't matter. And don't over-bucket: too many buckets recreates the small-files problem you just solved. Bucketing is a precision tool, not a default.

Partition projection: speed at scale, with one sharp edge

When a table has thousands of partitions, Athena spends real time just looking them up in the Glue Data Catalog before it can run anything. Partition projection removes that step. You describe the partition scheme in the table properties (a date range, an integer range, a set of values) and Athena calculates the partitions in memory instead of fetching them. On heavily partitioned tables it's the difference between a query that plans in seconds and one that plans in minutes.

partition_projection.sql

ALTER TABLE events SET TBLPROPERTIES (
  'projection.enabled'           = 'true',
  'projection.event_date.type'   = 'date',
  'projection.event_date.range'  = '2024-01-01,NOW',
  'projection.event_date.format' = 'yyyy-MM-dd',
  'storage.location.template'    = 's3://my-lake/events/${event_date}/'
);

Here's the edge nobody warns you about. Projection makes Athena trust your description of the partitions, so if you set the range wider than the data that actually exists, Athena will dutifully generate and probe partitions that aren't there. I've seen someone set a range ending in NOW+10YEARS "to be safe," and turn a fast table into a slow one, because every query now enumerates a decade of empty days. Set the range to the data you genuinely have, and widen it deliberately. Projection also fits poorly when partitions are sparse or irregular, since it assumes a predictable pattern. For those, stick with the catalog.

When a query is still slow: a quick diagnosis

When something's slow, resist the urge to rewrite the SQL first. Look at bytes scanned in the query history, then walk it back to the cause. Most slowdowns sort into four buckets.

Diagnose from bytes scanned, not from the SQL. The fix almost always lives in the storage layout, not the query text.

A few query-side habits help once the layout is right. Name your columns instead of SELECT *, so a columnar format can actually skip the rest. Put the larger table on the left of a join and the smaller on the right, so Athena distributes the small one. Reach for approx_distinct over COUNT(DISTINCT ...) when an estimate is fine. And turn on result reuse for dashboards that fire the same query all day. These are real, but they're the last 10%. The layout is the other 90%.

The anti-patterns that quietly cost you

Most wasted Athena spend isn't exotic. It's the same handful of habits, repeated across a lot of queries:

SELECT * in production. It cancels the whole point of columnar storage and reads every column, every run.
Filtering on a non-partition column and expecting pruning. Athena scans the full table and filters after.
Over-partitioning into millions of tiny files, so listing overhead costs more than the scan you saved.
Running MSCK REPAIR TABLE on a table with 100k+ partitions, where it crawls for an age. Use explicit ALTER TABLE ADD PARTITION, or switch to projection.
Projecting an absurd date range so Athena probes years of partitions that don't exist.
Ignoring result reuse for dashboards that re-run identical queries hundreds of times a day, paying full freight each time.

None of these need a re-architecture to fix. They need someone to look at what the queries are actually reading.

How I'd actually approach an Athena bill

The order matters more than the tricks. Measure first: pull the query history and sort by data scanned, because a small number of queries usually account for most of the cost. Fix the storage layout for those tables before touching any SQL. Then validate the only metric that pays the bill, bytes scanned, before and after.

I leaned on exactly this when I built a dynamic ETL pipeline for a sales-engagement SaaS, moving millions of records a day onto an S3 lake with Athena as the cheap, serverless query layer over it. What kept it cheap wasn't query tuning. It was the layout: Apache Hudi upserts so we rewrote only the rows that changed, Parquet on S3, and a partition scheme that matched how the data was actually queried. Data-prep time fell about 40%, and the query layer stayed cheap because nothing ever scanned more than it needed.

It's the same lesson from a 3-tier Glue pipeline I built for an environmental-analytics company: the win came from loading incrementally instead of reprocessing history, not from clever compute. Good incremental design and cheap querying turn out to be the same thing. If you want the upstream side of this, where the cost lives in the jobs rather than the queries, the AWS Glue cost guide is the companion piece.

Frequently asked questions

How is Amazon Athena priced?

On the standard model, Athena charges about $5 per terabyte of data scanned, rounded up to the nearest 10 MB with a 10 MB minimum per query. DDL statements and failed queries are free, and identical re-runs can be served from cache at no cost via query result reuse. There's also a provisioned-capacity option billed per DPU-hour for steady, heavy workloads.

Does partitioning actually reduce Athena cost?

Yes, when you filter on the partition column. Athena then reads only the matching folders instead of the whole table, so it scans fewer bytes and bills less. The two ways it goes wrong are filtering on a non-partition column (no pruning happens) and partitioning so finely that you drown in tiny files.

Is Parquet always better than CSV for Athena?

For analytics, almost always. Parquet is columnar and compressed, so a query reads only the columns it needs rather than every byte of every row, which typically cuts data scanned by 90% or more. The rare exception is a tiny, one-off dataset where the conversion isn't worth the effort.

When should I use Athena partition projection?

Use it when a table has thousands of partitions and a predictable scheme (a date or integer range), where catalog lookups have become the bottleneck. Avoid it for sparse or irregular partitions, and keep the projected range no wider than the data you actually have, or Athena will waste time probing partitions that don't exist.

Why is my Athena query slow even though the table is partitioned?

Usually one of three things: you're filtering on a column that isn't the partition key (so nothing prunes), the files are tiny or still in a row format like CSV, or a join is spilling. Start from the bytes-scanned figure in the query history and work back to which of those it is.

The takeaway

Athena's pricing is honest in a way that's easy to miss. It charges you for what you make it read. So almost every optimization is the same instinct wearing a different hat: read less. Partition so queries skip the irrelevant, store columnar so they read only what they ask for, keep files a sensible size, and the engine feels fast and cheap even on terabytes. Get the layout wrong and a serverless query service will still find a way to scan a fortune.

If your S3-and-Athena bill is climbing faster than your data is, that's the kind of work I do for a living. See how I package that work, or tell me about your setup and I'll take a look. For the layers either side of the query engine, the companion reads are cutting an AWS Glue bill by 50%+ and choosing and migrating a cloud data warehouse.

Mirza Hammad Tariq

AWS Data Engineer with 5+ years building production-grade ETL pipelines, cloud data warehouses, and scalable data architectures in Python, SQL, Dagster, and AWS.

Work With Me

Related case studies

Data EngineeringDelivered

Dynamic, Fully-Automated ETL Pipeline on AWS

100% API automation processing millions of records daily

Built a dynamic ETL pipeline on AWS for a sales-engagement SaaS platform, using Glue, Redshift, Apache Hudi and Athena to automate extraction of milli…

−40%automation

PythonPySparkAWSAWS Glue

View Case Study

Data EngineeringDelivered

Cloud-Native ETL for Environmental Analytics

A 3-tier AWS Glue pipeline for assessment, measurement & analysis data

Built a 3-tier data architecture on AWS for an environmental-solutions provider, using AWS Glue, S3, Redshift, Lambda and PySpark to lift data-process…

+30%AWS Glue

PythonAWS GlueAWS S3Amazon Redshift

View Case Study

Data EngineeringDelivered

Enterprise Data Warehouse Migration — Sybase to Amazon Redshift

Re-platforming a tier-1 banking EDW onto a cloud-native AWS warehouse

Led the migration of a major retail bank’s on-premises Enterprise Data Warehouse from Sybase to Amazon Redshift on AWS, re-engineering 800+ stored pro…

+40%Redshift compute

PythonSQLPL-SQLAmazon Redshift

View Case Study

Keep reading

Continue reading

Data EngineeringJun 17, 2026

AWS Glue Cost Optimization: Why Your Bill Exploded and How to Cut It

A data engineer's guide to AWS Glue cost optimization: why DPU billing causes bill shock, the 8 traps that inflate your spend, and how to cut Glue costs 50%+.

AWS GlueCost OptimizationData EngineeringETL

Read Article11 min read

Data EngineeringJun 17, 2026

Cloud Data Warehouse Migration: Snowflake vs Redshift vs BigQuery

A cloud data warehouse migration guide to Snowflake vs Redshift vs BigQuery vs Databricks: how to choose on cost, lock-in and performance, and how to de-risk the move.

Data WarehouseSnowflakeRedshiftBigQuery

Read Article14 min read

AI EngineeringJun 12, 2026

AI Cost Optimization: How We Cut a Document AI Bill by 99%

A practical AI cost optimization guide built on a real case study: how task-based model routing cut one document platform's AI spend by 99% and trims most LLM bills 30 to 50%.

AI CostsLLMCost OptimizationAI Strategy

Read Article12 min read

Taking on new projects · Outside IR35

Have a data pipeline or warehouse problem worth solving?

From messy source data to analytics-ready warehouses that cut cost. Let's scope it. I reply within one business day.

Start a Project Connect on LinkedIn