Saturday, August 20, 2022
HomeBig DataAnalyze Amazon Ion datasets utilizing Amazon Athena

Analyze Amazon Ion datasets utilizing Amazon Athena


Amazon Athena is an interactive question service that makes it simple to research knowledge in Amazon Easy Storage Service (Amazon S3) utilizing customary SQL. Athena is serverless, so there isn’t any infrastructure to handle, and also you pay just for the queries that you just run.

Amazon Ion is a richly typed, self-describing, hierarchical knowledge serialization format providing interchangeable binary and textual content representations. The textual content format extends JSON (that means all JSON recordsdata are legitimate Ion recordsdata), and is simple to learn and creator, supporting fast prototyping. The binary illustration is environment friendly to retailer, transmit, and skip-scan parse. The wealthy kind system gives unambiguous semantics for long-term preservation of knowledge that may survive a number of generations of software program evolution.

Athena now helps querying and writing knowledge in Ion format. The Ion format is at present utilized by inner Amazon groups, by exterior providers comparable to Amazon Quantum Ledger Database (Amazon QLDB) and Amazon DynamoDB (which may be exported into Ion), and within the open-source SQL question language PartiQL.

On this put up, we focus on use circumstances and the distinctive options Ion gives, adopted by examples of querying Ion with Athena. For demonstration functions, we use the remodeled model of the Metropolis Tons San Francisco dataset.

Options of Ion

On this part, we focus on a number of the distinctive options that Ion gives:

  • Kind system
  • Twin format
  • Effectivity beneficial properties
  • Skip scanning

Kind system

Ion extends JSON, including help for extra exact knowledge sorts to enhance interpretability, simplify processing, and keep away from rounding errors. These excessive precision numeric sorts are important for monetary providers, the place fractions of a cent on each transaction add up. Knowledge sorts which are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.

Twin format

Customers may be introduced with a well-known text-based illustration whereas benefiting from the efficiency efficiencies of a binary format. The interoperability between the 2 codecs allows you to quickly uncover, digest, and interpret knowledge in a well-known JSON-like illustration, whereas underlying functions profit from a discount in storage, reminiscence, community bandwidth, and latency from the binary format. This implies you’ll be able to write plain textual content queries that run towards each text-based and binary-based Ion. You possibly can rewrite elements of your knowledge in text-based Ion whenever you want human readable knowledge throughout growth and change to binary in manufacturing.

When debugging a course of, the power for methods engineers to find knowledge and perceive it as shortly as attainable is significant. Ion gives mechanisms to maneuver between binary and a text-based illustration, optimizing for each the human and the machine. Athena helps querying and writing knowledge in each of those Ion codecs. The next is an instance Ion textual content doc taken from the remodeled model of the citylots dataset:

{ "kind": "Characteristic"
, "properties": { "MAPBLKLOT": "0004002"
                 ,"BLKLOT": "0004002"
                 ,"BLOCK_NUM": "0004"
                 , "LOT_NUM": "002"
                 , "FROM_ST": "0"
                 , "TO_ST": "0"
                 , "STREET": "UNKNOWN"
                 , "ST_TYPE": null
                 , "ODD_EVEN": "E" }
, "geometry": { "kind": "Polygon"
               , "coordinates": [ [ [ -122.415701204606876, 37.808327252671461, 0.0 ],
                                    [ -122.415760743593196, 37.808630700240904, 0.0 ],
                                    [ -122.413787891332404, 37.808566801319841, 0.0 ],
                                    [ -122.415701204606876, 37.808327252671461, 0.0 ] ] ] } }

Effectivity beneficial properties

Binary-encoded Ion reduces file measurement by shifting repeated values, comparable to discipline names, right into a image desk. Image tables scale back CPU and browse latency by limiting the validation of character encoding to the one occasion of the worth within the image desk.

For instance, an organization that operates at Amazon’s scale can produce massive volumes of software logs. When compressing Ion and JSON logs, we observed roughly 35% much less CPU time to compress the log, which produced a median of roughly 26% smaller recordsdata. Log recordsdata are vital when wanted however expensive to retain, so the discount in file sizes mixed with the learn efficiency beneficial properties from image tables helps when dealing with these logs. The next is an instance of file measurement discount with the citylots JSON dataset when transformed to Ion binary with GZIP and ZSTD compression:

77MB    citylots.ion
 17MB    citylots.ion.gz
 15MB    citylots.ion.zst
181MB    citylots.json
 22MB    citylots.json.gz
 18MB    citylots.json.zst

Skip-scanning

In a textual format, each byte have to be learn and interpreted, however as a result of Ion’s binary format is a TLV (type-length-value) encoding, an software could skip over parts that aren’t wanted. This reduces question and software processing prices correlated with the proportion of unexamined fields.

For instance, forensic evaluation of software log knowledge includes studying massive volumes of knowledge the place solely a fraction of the info is required for prognosis. In these situations, skip-scanning permits the binary Ion reader to maneuver previous irrelevant fields with out the price of studying the ingredient saved inside a discipline. This ends in customers experiencing decrease useful resource utilization and faster response instances.

Question Ion datasets utilizing Athena

Athena now helps querying and creating Ion-formatted datasets through an Ion-specific SerDe, which along side IonInputFormat and IonOutputFormat means that you can learn and write legitimate Ion knowledge. Deserialization means that you can run SELECT queries on the Ion knowledge in order that it may be queried to realize insights. Serialization by way of CTAS or INSERT INTO queries means that you can copy datasets from present tables’ values or generate new knowledge within the Ion format.

The interchangeable nature of Ion textual content and Ion binary implies that Athena can learn datasets that comprise each varieties of recordsdata. As a result of Ion is a superset of JSON, a desk utilizing the Ion SerDe can even embody JSON recordsdata. Not like the JSON SerDe, the place each new line character signifies a brand new row, the Ion SerDe makes use of a mix of closing brackets and new line characters to find out new rows. Which means that if every JSON file in your supply paperwork isn’t on a single line, these recordsdata can now be learn in Athena through the Ion SerDe.

Create exterior tables

Athena helps querying Ion-based datasets by defining AWS Glue tables with the user-defined metadata. Let’s begin with an instance of making an exterior desk for a dataset saved in Ion textual content. The next is a pattern row from the citylots dataset:

{
    kind:"Characteristic",
    properties:{
        mapblklot:"0579021",
        blklot:"0579024",
        block_num:"0579",
        lot_num:"024",
        from_st:"2160",
        to_st:"2160",
        avenue:"PACIFIC",
        st_type:"AVE",
        odd_even:"E"
    },
    geometry:{
        kind:"Polygon",coordinates:[[[-122.4308798855922, ...]]]
    }
}

To create an exterior desk that has its knowledge saved in Ion, you’ve two syntactic choices.

First, you’ll be able to specify STORED AS ION. This can be a extra concise methodology, and is greatest used for easy circumstances, when no extra properties are required. See the next code:

CREATE EXTERNAL TABLE city_lots_ion1 (
  kind STRING, 
  properties struct<
    mapblklot:string,
    blklot:string,
    block_num:string,
    lot_num:string,
    from_st:string,
    to_st:string,
    avenue:string,
    st_type:string,
    odd_even:string>, 
  geometry struct<
    kind:string,
    coordinates:array<array<array<decimal(18,15)>>>,
    multi_coordinates:array<array<array<array<decimal(18,15)>>>>>
)
STORED AS ION
LOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'

Alternatively, you’ll be able to explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Not like the primary methodology, you’ll be able to specify a SERDEPROPERTIES clause right here. In our instance DDL, we added a SerDe property that enables values which are outdoors of the Hive knowledge kind ranges to overflow slightly than fail the question:

CREATE EXTERNAL TABLE city_lots_ion2(
  kind STRING, 
  properties struct<
    mapblklot:string,
    blklot:string,
    block_num:string,
    lot_num:string,
    from_st:string,
    to_st:string,
    avenue:string,
    st_type:string,
    odd_even:string>, 
  geometry struct<
    kind:string,
    coordinates:array<array<array<decimal(18,15)>>>,
    multi_coordinates:array<array<array<array<decimal(18,15)>>>>>
)
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.fail_on_overflow'='false'
 )
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.codecs.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.codecs.IonOutputFormat'
LOCATION
  's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'

Athena converts STORED AS ION into the express classpaths, so each tables look comparable within the metastore. If we glance in AWS Glue, we see each tables we simply created have the identical enter format, output format, and SerDe serialization library.

Now that our desk is created, we are able to run customary SELECT queries on the city_lots_ion desk. Let’s run a question that specifies the block_num from our instance row of Ion knowledge to confirm that we are able to learn from the desk:

-- QUERY
SELECT * FROM city_lots_ion1 WHERE properties.block_num='0579';

The next screenshot exhibits our outcomes.

Use path extraction to learn from particular fields

Athena helps additional customization of how knowledge is interpreted through SerDe properties. To specify these, you’ll be able to add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE discipline.

In some conditions, we could solely care about some elements of the data. Let’s suppose we don’t need any of the geometry data from the citylots dataset, and solely want a couple of of the fields in properties. One resolution is to specify a search path utilizing the trail extractor SerDe property:

-- Path Extractor property
ion.<column>.path_extractor = <search path>

Path extractors are search paths that Athena makes use of to map the desk columns to places within the particular person doc. Full info on what may be achieved with path extractors is obtainable on GitHub, however for our instance, we concentrate on creating easy paths that use the names of every discipline as an index. On this case, the search path takes the type of a space-delimited set of indexes (and wraps it with parentheses) that point out the situation of every desired piece of knowledge. We map the search paths to desk columns through the use of the trail extractor property.

By default, Athena builds path extractors dynamically based mostly on column names except overridden. Which means that after we run our SELECT question on our city_lots_ion1 desk, Athena builds the next search paths:

Default Extractors generated by Athena for city_lots_ion1.
-- Extracts the 'kind' discipline to the 'kind' column
    'ion.kind.path_extractor' = '(kind)'

-- Extracts the 'properties' discipline to the 'properties' column
    'ion.properties.path_extractor' = '(properties)'

-- Extracts the 'geometry' discipline to the 'geometry' column
    'ion.geometry.path_extractor' = '(geometry)'

Assuming we solely care in regards to the block and lot info from the properties struct, and the geometry kind from the geometry struct, we are able to construct search paths that map the specified fields from the row of knowledge to desk columns. First let’s construct the search paths:

(properties mapblklot) - Search path for the mapblklot discipline within the properties struct
(properties blklot) - Search path for the blklot discipline within the properties struct
(properties block_num) - Search path for the block_num discipline within the properties struct
(properties lot_num) - Search path for the lot_num discipline within the properties struct
(geometry kind) - Search path for the kind discipline within the geometry struct

Now let’s map these search paths to desk columns utilizing the trail extractor SerDe property. As a result of the search paths specify the place to search for knowledge, we’re in a position to flatten and rename our datasets to higher serve our objective. For this instance, let’s rename the mapblklot discipline to map_block_lot, blklot to block_lot, and the geometry kind to form:

 'ion.map_block_lot.path_extractor' = '(properties mapblklot)'
 'ion.block_lot.path_extractor' = '(properties blklot)'
 'ion.block_num.path_extractor' = '(properties block_num)'
 'ion.lot_num.path_extractor' = '(properties lot_num)'
 'ion.form.path_extractor' = '(geometry kind)'

Let’s put all of this collectively and create the city_blocks desk:

CREATE EXTERNAL TABLE city_blocks (
    map_block_lot STRING,
    block_lot STRING,
    block_num STRING,
    lot_num STRING,
    form STRING
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.map_block_lot.path_extractor' = '(properties mapblklot)',
 'ion.block_lot.path_extractor' = '(properties blklot)', 
 'ion.block_num.path_extractor' = '(properties block_num)',
 'ion.lot_num.path_extractor' = '(properties lot_num)',
 'ion.form.path_extractor' = '(geometry kind)'
 )
STORED AS ION
LOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'

Now we are able to run a choose question on the city_blocks desk, and see the outcomes:

-- Choose Question
SELECT * FROM city_blocks WHERE block_num='0579';

Using search paths on this means allows skip-scan parsing when studying from Ion binary recordsdata, which permits Athena to skip over the unneeded fields and reduces the general time it takes to run the question.

Use CTAS and UNLOAD for knowledge transformation

Athena helps CREATE TABLE AS SELECT (CTAS), which creates a brand new desk in Athena from the outcomes of a SELECT assertion from one other question. Athena additionally helps UNLOAD, which writes question outcomes to Amazon S3 from a SELECT assertion to the desired knowledge format.

Each CTAS and UNLOAD have a property to specify a format and a compression kind. This lets you simply convert Ion datasets to different knowledge codecs, comparable to Parquet or ORC, and vice versa, without having to arrange a fancy extract, remodel, and cargo (ETL) job. That is useful for conditions whenever you need to remodel your knowledge, or know you’ll run repeated queries on a subset of your knowledge and need to use a number of the advantages inherent to columnar codecs. Combining it with path extractors is very useful, as a result of we’re solely storing the info that we’d like within the new format.

Let’s use CTAS to transform our city_blocks desk from Ion to Parquet, and compress it through GZIP. As a result of we’ve path extractors arrange for the city_blocks desk, we solely have to convert a small portion of the unique dataset:

CREATE TABLE city_blocks_parquet_gzip
WITH (format="PARQUET", write_compression='GZIP')
AS SELECT * FROM city_blocks;

We will now run queries towards the city_block_parquet_gzip desk, and may see the identical end result. To check this out, let’s run the identical SELECT question we ran earlier than on the Parquet desk:

SELECT * FROM city_blocks_parquet_gzip WHERE block_num='0579';

When changing tables from one other format to Ion, Athena helps the next compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. Along with including Ion as a brand new format for CTAS, we added the ion_encoding property, which lets you select whether or not the output recordsdata are created in Ion textual content or Ion binary. This permits for serialization of knowledge from different codecs again into Ion.

Let’s convert the unique city_lots JSON file again to Ion, however this time we specify that we need to use ZSTD compression and a binary encoding.

The JSON file may be discovered at following location: s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/

As a result of Ion is a superset of JSON, we are able to use the Ion SerDe to learn this file:

CREATE EXTERNAL TABLE city_blocks_json_ion_serde (
    map_block_lot STRING,
    block_lot STRING,
    block_num STRING,
    lot_num STRING,
    form STRING
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
'ion.map_block_lot.path_extractor' = '(properties mapblklot)',
'ion.block_lot.path_extractor' = '(properties blklot)',
'ion.block_num.path_extractor' = '(properties block_num)',
'ion.lot_num.path_extractor' = '(properties lot_num)',
'ion.form.path_extractor' = '(geometry kind)'
)
STORED AS ION
LOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/'

Now let’s copy this desk into our desired Ion binary kind:

CREATE TABLE city_blocks_ion_zstd
WITH (format="ION", write_compression='ZSTD', ion_encoding='BINARY')
AS SELECT * FROM city_blocks_parquet_gzip

Lastly, let’s run our verification SELECT assertion to confirm every thing was created correctly:

SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579'; 

Use UNLOAD to retailer Ion knowledge in Amazon S3

Typically we simply need to reformat the info and don’t have to retailer the extra metadata to question the desk. On this case, we are able to use UNLOAD, which shops the outcomes of the question within the specified format in an S3 bucket.

Let’s check it out, utilizing UNLOAD to transform the drivers_names desk from Ion to ORC, compress it through ZLIB, and retailer it to an S3 bucket:

UNLOAD (SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579') 
TO 's3://<your-s3-bucket>/athena-ion-blog/unload/orc_zlib/'
WITH (format="ORC", compression='ZLIB')

If you test in Amazon S3, you’ll find a brand new file within the ORC format.

Conclusion

This put up talked in regards to the new function in Athena that means that you can question and create Ion datasets utilizing customary SQL. We mentioned use circumstances and distinctive options of the Ion format like kind system, twin codecs (Ion textual content and Ion binary), effectivity beneficial properties, and skip-scanning. You may get began with querying an Ion dataset saved in Amazon S3 by merely making a desk in Athena, and likewise changing present datasets to Ion format and vice versa utilizing CTAS and UNLOAD statements.

To be taught extra about querying Ion utilizing Athena, discuss with Amazon Ion Hive SerDe.

References


In regards to the Authors

Pathik Shah is a Sr. Huge Knowledge Architect on Amazon Athena. He joined AWS in 2015 and has been focusing within the massive knowledge analytics house since then, serving to prospects construct scalable and sturdy options utilizing AWS analytics providers.

Jacob Stein works on the Amazon Athena group as a Software program Improvement Engineer. He led the challenge so as to add help for Ion in Athena. He loves engaged on technical issues distinctive to web scale knowledge, and is obsessed with growing scalable options for distributed methods.

Giovanni Matteo Fumarola is the Engineering Supervisor of the Athena Knowledge Lake and Storage group. He’s an Apache Hadoop Committer and PMC member. He has been focusing within the massive knowledge analytics house since 2013.

Pete Ford is a Sr. Technical Program Supervisor at Amazon.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments