Amazon Athena is an interactive question service that makes it simple to research knowledge in Amazon Easy Storage Service (Amazon S3) utilizing customary SQL. Athena is serverless, so there isn’t any infrastructure to handle, and also you pay just for the queries that you just run.
Amazon Ion is a richly typed, self-describing, hierarchical knowledge serialization format providing interchangeable binary and textual content representations. The textual content format extends JSON (that means all JSON recordsdata are legitimate Ion recordsdata), and is simple to learn and creator, supporting fast prototyping. The binary illustration is environment friendly to retailer, transmit, and skip-scan parse. The wealthy kind system gives unambiguous semantics for long-term preservation of knowledge that may survive a number of generations of software program evolution.
Athena now helps querying and writing knowledge in Ion format. The Ion format is at present utilized by inner Amazon groups, by exterior providers comparable to Amazon Quantum Ledger Database (Amazon QLDB) and Amazon DynamoDB (which may be exported into Ion), and within the open-source SQL question language PartiQL.
On this put up, we focus on use circumstances and the distinctive options Ion gives, adopted by examples of querying Ion with Athena. For demonstration functions, we use the remodeled model of the Metropolis Tons San Francisco dataset.
Options of Ion
On this part, we focus on a number of the distinctive options that Ion gives:
- Kind system
- Twin format
- Effectivity beneficial properties
- Skip scanning
Kind system
Ion extends JSON, including help for extra exact knowledge sorts to enhance interpretability, simplify processing, and keep away from rounding errors. These excessive precision numeric sorts are important for monetary providers, the place fractions of a cent on each transaction add up. Knowledge sorts which are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.
Twin format
Customers may be introduced with a well-known text-based illustration whereas benefiting from the efficiency efficiencies of a binary format. The interoperability between the 2 codecs allows you to quickly uncover, digest, and interpret knowledge in a well-known JSON-like illustration, whereas underlying functions profit from a discount in storage, reminiscence, community bandwidth, and latency from the binary format. This implies you’ll be able to write plain textual content queries that run towards each text-based and binary-based Ion. You possibly can rewrite elements of your knowledge in text-based Ion whenever you want human readable knowledge throughout growth and change to binary in manufacturing.
When debugging a course of, the power for methods engineers to find knowledge and perceive it as shortly as attainable is significant. Ion gives mechanisms to maneuver between binary and a text-based illustration, optimizing for each the human and the machine. Athena helps querying and writing knowledge in each of those Ion codecs. The next is an instance Ion textual content doc taken from the remodeled model of the citylots
dataset:
Effectivity beneficial properties
Binary-encoded Ion reduces file measurement by shifting repeated values, comparable to discipline names, right into a image desk. Image tables scale back CPU and browse latency by limiting the validation of character encoding to the one occasion of the worth within the image desk.
For instance, an organization that operates at Amazon’s scale can produce massive volumes of software logs. When compressing Ion and JSON logs, we observed roughly 35% much less CPU time to compress the log, which produced a median of roughly 26% smaller recordsdata. Log recordsdata are vital when wanted however expensive to retain, so the discount in file sizes mixed with the learn efficiency beneficial properties from image tables helps when dealing with these logs. The next is an instance of file measurement discount with the citylots
JSON dataset when transformed to Ion binary with GZIP and ZSTD compression:
Skip-scanning
In a textual format, each byte have to be learn and interpreted, however as a result of Ion’s binary format is a TLV (type-length-value) encoding, an software could skip over parts that aren’t wanted. This reduces question and software processing prices correlated with the proportion of unexamined fields.
For instance, forensic evaluation of software log knowledge includes studying massive volumes of knowledge the place solely a fraction of the info is required for prognosis. In these situations, skip-scanning permits the binary Ion reader to maneuver previous irrelevant fields with out the price of studying the ingredient saved inside a discipline. This ends in customers experiencing decrease useful resource utilization and faster response instances.
Question Ion datasets utilizing Athena
Athena now helps querying and creating Ion-formatted datasets through an Ion-specific SerDe, which along side IonInputFormat
and IonOutputFormat
means that you can learn and write legitimate Ion knowledge. Deserialization means that you can run SELECT queries on the Ion knowledge in order that it may be queried to realize insights. Serialization by way of CTAS or INSERT INTO queries means that you can copy datasets from present tables’ values or generate new knowledge within the Ion format.
The interchangeable nature of Ion textual content and Ion binary implies that Athena can learn datasets that comprise each varieties of recordsdata. As a result of Ion is a superset of JSON, a desk utilizing the Ion SerDe can even embody JSON recordsdata. Not like the JSON SerDe, the place each new line character signifies a brand new row, the Ion SerDe makes use of a mix of closing brackets and new line characters to find out new rows. Which means that if every JSON file in your supply paperwork isn’t on a single line, these recordsdata can now be learn in Athena through the Ion SerDe.
Create exterior tables
Athena helps querying Ion-based datasets by defining AWS Glue tables with the user-defined metadata. Let’s begin with an instance of making an exterior desk for a dataset saved in Ion textual content. The next is a pattern row from the citylots
dataset:
To create an exterior desk that has its knowledge saved in Ion, you’ve two syntactic choices.
First, you’ll be able to specify STORED AS ION. This can be a extra concise methodology, and is greatest used for easy circumstances, when no extra properties are required. See the next code:
Alternatively, you’ll be able to explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Not like the primary methodology, you’ll be able to specify a SERDEPROPERTIES clause right here. In our instance DDL, we added a SerDe property that enables values which are outdoors of the Hive knowledge kind ranges to overflow slightly than fail the question:
Athena converts STORED AS ION into the express classpaths, so each tables look comparable within the metastore. If we glance in AWS Glue, we see each tables we simply created have the identical enter format, output format, and SerDe serialization library.
Now that our desk is created, we are able to run customary SELECT queries on the city_lots_ion
desk. Let’s run a question that specifies the block_num
from our instance row of Ion knowledge to confirm that we are able to learn from the desk:
The next screenshot exhibits our outcomes.
Use path extraction to learn from particular fields
Athena helps additional customization of how knowledge is interpreted through SerDe properties. To specify these, you’ll be able to add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE discipline.
In some conditions, we could solely care about some elements of the data. Let’s suppose we don’t need any of the geometry data from the citylots
dataset, and solely want a couple of of the fields in properties. One resolution is to specify a search path utilizing the trail extractor SerDe property:
Path extractors are search paths that Athena makes use of to map the desk columns to places within the particular person doc. Full info on what may be achieved with path extractors is obtainable on GitHub, however for our instance, we concentrate on creating easy paths that use the names of every discipline as an index. On this case, the search path takes the type of a space-delimited set of indexes (and wraps it with parentheses) that point out the situation of every desired piece of knowledge. We map the search paths to desk columns through the use of the trail extractor property.
By default, Athena builds path extractors dynamically based mostly on column names except overridden. Which means that after we run our SELECT question on our city_lots_ion1
desk, Athena builds the next search paths:
Assuming we solely care in regards to the block and lot info from the properties struct, and the geometry kind from the geometry struct, we are able to construct search paths that map the specified fields from the row of knowledge to desk columns. First let’s construct the search paths:
Now let’s map these search paths to desk columns utilizing the trail extractor SerDe property. As a result of the search paths specify the place to search for knowledge, we’re in a position to flatten and rename our datasets to higher serve our objective. For this instance, let’s rename the mapblklot
discipline to map_block_lot
, blklot
to block_lot
, and the geometry kind to form:
Let’s put all of this collectively and create the city_blocks
desk:
Now we are able to run a choose question on the city_blocks
desk, and see the outcomes:
Using search paths on this means allows skip-scan parsing when studying from Ion binary recordsdata, which permits Athena to skip over the unneeded fields and reduces the general time it takes to run the question.
Use CTAS and UNLOAD for knowledge transformation
Athena helps CREATE TABLE AS SELECT (CTAS), which creates a brand new desk in Athena from the outcomes of a SELECT assertion from one other question. Athena additionally helps UNLOAD, which writes question outcomes to Amazon S3 from a SELECT assertion to the desired knowledge format.
Each CTAS and UNLOAD have a property to specify a format and a compression kind. This lets you simply convert Ion datasets to different knowledge codecs, comparable to Parquet or ORC, and vice versa, without having to arrange a fancy extract, remodel, and cargo (ETL) job. That is useful for conditions whenever you need to remodel your knowledge, or know you’ll run repeated queries on a subset of your knowledge and need to use a number of the advantages inherent to columnar codecs. Combining it with path extractors is very useful, as a result of we’re solely storing the info that we’d like within the new format.
Let’s use CTAS to transform our city_blocks
desk from Ion to Parquet, and compress it through GZIP. As a result of we’ve path extractors arrange for the city_blocks
desk, we solely have to convert a small portion of the unique dataset:
We will now run queries towards the city_block_parquet_gzip
desk, and may see the identical end result. To check this out, let’s run the identical SELECT question we ran earlier than on the Parquet desk:
When changing tables from one other format to Ion, Athena helps the next compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. Along with including Ion as a brand new format for CTAS, we added the ion_encoding
property, which lets you select whether or not the output recordsdata are created in Ion textual content or Ion binary. This permits for serialization of knowledge from different codecs again into Ion.
Let’s convert the unique city_lots
JSON file again to Ion, however this time we specify that we need to use ZSTD compression and a binary encoding.
The JSON file may be discovered at following location: s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/
As a result of Ion is a superset of JSON, we are able to use the Ion SerDe to learn this file:
Now let’s copy this desk into our desired Ion binary kind:
Lastly, let’s run our verification SELECT assertion to confirm every thing was created correctly:
Use UNLOAD to retailer Ion knowledge in Amazon S3
Typically we simply need to reformat the info and don’t have to retailer the extra metadata to question the desk. On this case, we are able to use UNLOAD, which shops the outcomes of the question within the specified format in an S3 bucket.
Let’s check it out, utilizing UNLOAD to transform the drivers_names
desk from Ion to ORC, compress it through ZLIB, and retailer it to an S3 bucket:
If you test in Amazon S3, you’ll find a brand new file within the ORC format.
Conclusion
This put up talked in regards to the new function in Athena that means that you can question and create Ion datasets utilizing customary SQL. We mentioned use circumstances and distinctive options of the Ion format like kind system, twin codecs (Ion textual content and Ion binary), effectivity beneficial properties, and skip-scanning. You may get began with querying an Ion dataset saved in Amazon S3 by merely making a desk in Athena, and likewise changing present datasets to Ion format and vice versa utilizing CTAS and UNLOAD statements.
To be taught extra about querying Ion utilizing Athena, discuss with Amazon Ion Hive SerDe.
References
In regards to the Authors
Pathik Shah is a Sr. Huge Knowledge Architect on Amazon Athena. He joined AWS in 2015 and has been focusing within the massive knowledge analytics house since then, serving to prospects construct scalable and sturdy options utilizing AWS analytics providers.
Jacob Stein works on the Amazon Athena group as a Software program Improvement Engineer. He led the challenge so as to add help for Ion in Athena. He loves engaged on technical issues distinctive to web scale knowledge, and is obsessed with growing scalable options for distributed methods.
Giovanni Matteo Fumarola is the Engineering Supervisor of the Athena Knowledge Lake and Storage group. He’s an Apache Hadoop Committer and PMC member. He has been focusing within the massive knowledge analytics house since 2013.
Pete Ford is a Sr. Technical Program Supervisor at Amazon.