Thursday, July 7, 2022
HomeBig DataOptimizing Hive on Tez Efficiency

Optimizing Hive on Tez Efficiency


Tuning Hive on Tez queries can by no means be executed in a one-size-fits-all method. The efficiency on queries is determined by the scale of the info, file varieties, question design, and question patterns. Throughout efficiency testing, consider and validate configuration parameters and any SQL modifications. It’s advisable to make one change at a time throughout efficiency testing of the workload, and could be finest to evaluate the influence of tuning modifications in your growth and QA environments earlier than utilizing them in manufacturing environments. Cloudera WXM can help in evaluating the advantages of question modifications throughout efficiency testing.

Tuning Pointers

It has been noticed throughout a number of migrations from CDH distributions to CDP Personal Cloud that Hive on Tez queries are likely to carry out slower in comparison with older execution engines like MR or Spark. That is often brought on by variations in out-of-the-box tuning habits between the completely different execution engines. Moreover, customers could have accomplished tuning within the legacy distribution that’s not routinely mirrored within the conversion to Hive on Tez. For customers upgrading from HDP distribution, this dialogue would additionally assist to overview and validate if the properties are appropriately configured for efficiency in CDP. 

The steps beneath aid you establish the areas to concentrate on that may degrade efficiency. 

Step 1: Confirm and validate the YARN Capability Scheduler configurations. A misconfigured queue configuration can influence question efficiency because of an arbitrary cap on accessible sources to the person. Validate the user-limit issue, min-user-limit %, and most capability. (Check with the YARN – The Capability Scheduler weblog to know these configuration settings.) 

Step 2: Assessment the relevance of any security valves (the non-default values for Hive and HiveServer2 configurations) for Hive and Hive on Tez. Take away any legacy and outdated properties.

Step 3: Establish the realm of slowness, akin to map duties, scale back duties, and joins.

  1. Assessment the generic Tez engine and platform tunable properties.
  2. Assessment the map duties and tuneenhance/lower the duty counts as required.
  3. Assessment the scale back duties and tuneenhance/lower the duty counts as required.
  4. Assessment any concurrency associated points—tlisted below are two sorts of concurrency points as listed beneath:
    • Concurrency amongst customers inside a queue. This may be tuned utilizing the person restrict issue of the YARN queue (refer the main points in Capability Scheduler weblog).
    • Concurrency throughout pre-warmed containers for Hive on Tez periods, as mentioned intimately beneath.

Understanding parallelization in Tez

Earlier than altering any configurations, you will need to perceive the mechanics of how Tez works internally. For instance, this contains understanding how Tez determines the proper variety of mappers and reducers. Reviewing the Tez structure design and the main points relating to how the preliminary duties parallelism and auto-reduce parallelism works will aid you optimize the question efficiency. 

Understanding numbers of mappers

Tez determines the variety of mapper duties utilizing the preliminary enter knowledge for the job. In Tez, the variety of duties are decided by the grouping splits, which is equal to the variety of mappers decided by the enter splits in map scale back jobs.

  • tez.grouping.min-size and tez.grouping.max-size decide the variety of mappers. The default values for min-size is 16 MB and max-size is 1 GB.
  • Tez determines the variety of duties such that the info per activity is according to the grouping max/min dimension. 
  • Reducing the tez.grouping.max-size will increase the variety of duties/mappers.
  • Growing the tez.grouping.max-size decreases the variety of duties.
  • Contemplate the next instance: 
    • Enter knowledge (enter shards/splits) – 1000 information (round 1.5 MB dimension)
    • Complete knowledge dimension could be – 1000*1.5 MB = ~ 1.5 GB
    • Tez may attempt processing this knowledge with at the least two duties as a result of max knowledge/activity might be 1 G. Finally, Tez may power 1000 information (splits) to be mixed to 2 duties, resulting in slower execution instances.
    • If the tez.grouping.max-size is decreased from 1 GB to 100 MB, the variety of mappers might be elevated to fifteen offering higher parallelism. Efficiency then will increase as a result of the improved parallelism will increase the work unfold from two concurrent duties to fifteen.

The above is an instance situation, nonetheless in a manufacturing surroundings the place one makes use of binary file codecs like ORC or parquet, figuring out the variety of mappers relying on storage kind, break up technique file, or HDFS block boundaries may get sophisticated. 

Word: The next diploma of parallelism (e.g. excessive variety of mappers/reducers) doesn’t at all times translate to raised efficiency, because it may result in fewer sources per activity and better useful resource wastage because of activity overhead. 

Understanding the numbers of reducers

Tez makes use of a lot of mechanisms and settings to find out the variety of reducers required to finish a question.

  • Tez determines the reducers routinely based mostly on the info (variety of bytes) to be processed.
  • If hive.tez.auto.reducer.parallelism is ready to true, hive estimates knowledge dimension and units parallelism estimates. Tez will pattern supply vertices’ output sizes and regulate the estimates at runtime as essential.
  • By default the max reducers quantity is ready to 1009 ( hive.exec.reducers.max
  • Hive/Tez estimates the variety of reducers utilizing the next method after which schedules the Tez DAG:

Max(1, Min(hive.exec.reducers.max [1009], ReducerStage estimate/hive.exec.reducers.bytes.per.reducer))  x  hive.tez.max.partition.issue [2]

  • The next three parameters could be tweaked to extend or lower the variety of mappers: 
    1. hive.exec.reducers.bytes.per.reducer
      Dimension per reducer. Change this to a smaller worth to extend parallelism or change it to a bigger worth to lower parallelism. Default Worth = 256 MB [i.e if the input size is 1 GB then 4 reducers will be used]
    2. tez.min.partition.issue Default Worth = 0.25
    3. tez.max.partition.issue Default Worth = 2.0
      Enhance for extra reducers. Lower for much less variety of  reducers.
  • Customers can manually set the variety of reducers through the use of  mapred.scale back.duties. This isn’t really helpful and it’s best to keep away from utilizing this.
  • Suggestions:  
    • Keep away from setting the reducers manually.
    • Including extra reducers doesn’t at all times assure higher efficiency.
    • Relying on the scale back stage estimates, tweak the hive.exec.reducers.bytes.per.reducer parameter to a decrease or larger worth if you wish to enhance or lower the variety of reducers.

Concurrency 

This part goals to assist in understanding and tuning concurrent periods for Hive on Tez, akin to working a number of Tez AM containers. The beneath properties assist to know default queues and the variety of periods habits.

  • hive.server2.tez.default.queues : A listing of comma separated values comparable to YARN queues for which to take care of a Tez session pool.
  • hive.server2.tez.periods.per.default.queue: The variety of Tez periods (DAGAppMaster) to take care of within the pool per YARN queue.
  • hive.server2.tez.initialize.default.periods: If enabled, HiveServer2 (HS2), at startup, will launch all essential Tez periods throughout the specified default.queues to satisfy the periods.per.default.queue necessities.

While you outline the beneath listed properties, HiveServer2 will create one Tez Utility Grasp (AM) for every default queue, multiplied by the variety of periods when HiveServer2 service begins. Therefore:

(Tez Periods)complete = HiveServer2instances x (default.queues) x (periods.per.default.queue)

Understanding through Instance:

  • hive.server2.tez.default.queues= “queue1, queue2”
  • hive.server2.tez.periods.per.default.queue=2
    =>Hiveserver2 will create 4 Tez AM (2 for queue1 and a pair of for queue2).

Word: The pooled Tez periods are at all times working, even on an idle cluster.

If there’s steady utilization of HiveServer2, these Tez AM will maintain working, but when your HS2 is idle, these Tez AM might be killed based mostly on timeout outlined by tez.session.am.dag.submit.timeout.secs.

Case 1: Queue identify shouldn’t be specified 

  • A question will solely use a Tez AM from the pool (initialized as described above) if one doesn’t specify queue identify (tez.queue.identify).   On this case, HiveServer2 will choose certainly one of Tez AM idle/accessible (queue identify right here could also be randomly chosen). 
  • If one doesn’t specify a queue identify,  the question stays in pending state with HiveServer2 till one of many default Tez AMs from the initialized pool is accessible to serve the question. There received’t be any message in JDBC/ODBC shopper or within the HiveServer2 log file. As a result of no message is generated when the question is pending, the person might imagine the JDBC/ODBC connection or HiveServer2 is damaged, but it surely’s ready for a Tez AM to execute the question.

Case 2: Queue identify specified 

  • If one does specify the queue identify, it doesn’t matter what number of initialized Tez AMs are in use or idle, HiveServer2 will create a brand new Tez AM for this connection and the question could be executed (if the queue has accessible sources).

Pointers/suggestions for concurrency: 

  • To be used instances or queries the place one doesn’t need customers restricted to the identical Tez AM pool, set this hive.server2.tez.initialize.default.periods to false. Disabling this could scale back competition on HiveServer2 and enhance question efficiency.
  • Moreover, enhance the variety of periods hive.server2.tez.periods.per.default.queue
  • If there are use instances requiring a separate or devoted Tez AM pool for every group of customers, one might want to have devoted HiveServer2 service, every of them with a respective default queue identify and variety of periods, and ask every group of customers to make use of their respective HiveServer2.

Container reuse and prewarm containers

  • Container reuse:
    That is an optimization that limits the startup time influence on containers. That is turned on by setting tez.am.container.reuse.enabled to true. This protects  time interacting with YARN. I additionally maintain container teams alive, a quicker spin of containers, and skip yarn queues.
  • Prewarm containers:  
    The variety of containers is said to the quantity of YARN execution containers that might be connected to every Tez AM by default. This identical variety of containers might be held by every AM, even when Tez AM is idle (not executing queries).
    The draw back of this would seem in instances the place there are too many containers sitting idle and never launched, because the containers outlined right here could be held by Tez AM even when it’s idle. These idle containers would proceed taking sources in YARN that different purposes may probably make the most of.
    The beneath properties are used to configure Prewarm Containers:
    • hive.prewarm.enabled
    • hive.prewarm.numcontainers

Common Tez tuning parameters 

Assessment the properties listed beneath as a first-level verify when coping with efficiency degradation of Hive on Tez queries. You may must set or tune a few of these properties in accordance along with your question and knowledge properties. It could be finest to evaluate the configuration properties in growth and QA environments, after which push it to manufacturing environments relying on the outcomes. 

  • hive.cbo.allow
    Setting this property to true allows the cost-based optimization (CBO). CBO is a part of Hive’s question processing engine. It’s powered by Apache Calcite. CBO generates environment friendly question plans by inspecting tables and situations specified within the question, ultimately lowering the question execution time and enhancing useful resource utilization.
  • hive.auto.convert.be a part of
    Setting this property to true permits Hive to allow the optimization about changing frequent be a part of into mapjoin based mostly on the enter file dimension.
  • hive.auto.convert.be a part of.noconditionaltask.dimension
    It would be best to carry out as many mapjoins as doable within the question.  This dimension configuration allows the person to manage what dimension desk can slot in reminiscence. This worth represents the sum of the sizes of tables that may be transformed to hashmaps that slot in reminiscence.
    The advice could be to set this as ⅓ the scale of hive.tez.container.dimension.
  • tez.runtime.io.kind.mb
    The scale of the type buffer when output is sorted. The advice could be to set this to 40% of hive.tez.container.dimension as much as a most of two GB. It could not often have to be above this most. 
  • tez.runtime.unordered.output.buffer.size-mb
    That is the reminiscence when the output doesn’t have to be sorted. It’s the dimension of the buffer to make use of if not writing on to disk. The advice could be to set this to 10% of hive.tez.container.dimension.
  • hive.exec.parallel
    This property allows parallel execution of Hive question phases. By default, that is set to false. Setting this property to true helps to parallelize the unbiased question phases, leading to total improved efficiency.
  • hive.vectorized.execution.enabled
    Vectorized question execution is a Hive characteristic that drastically reduces the CPU utilization for typical question operations like scans, filters, aggregates, and joins. By default that is set to false. Set this to true.
  • hive.merge.tezfiles
    By default, this property is ready to false. Setting this property to true would merge the Tez information. Utilizing this property may enhance or lower the execution time of the question relying on dimension of the info or variety of information to merge. Assess your question efficiency in decrease environments earlier than utilizing this property. 
  • hive.merge.dimension.per.activity
    This property describes the dimension of the merged information on the finish of a job.
  • hive.merge.smallfiles.avgsize
    When the typical output file dimension of a job is lower than this quantity, Hive will begin a further map-reduce job to merge the output information into larger information. By default, this property is ready at 16 MB. 

Abstract

This weblog coated some fundamental troubleshooting and tuning pointers for Hive on Tez queries with respect to CDP. Because the very first step in question efficiency evaluation, it’s best to confirm and validate all of the configurations set on Hive and Hive on Tez companies. Each change made needs to be examined to make sure that it makes a measurable and helpful enchancment. Question tuning is a specialised effort and never all queries carry out higher by altering the Tez configuration properties. You could encounter situations the place that you must deep dive into the SQL question to optimize and enhance the execution and efficiency. Contact your Cloudera Account and Skilled Providers crew to supply steering should you require further help on efficiency tuning efforts.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments