Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance
- File formats
- Parallel execution
The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together.
So when we try to query based on age range, then hive will retrieve the data by going into particular folders instead of parsing through whole data.
Bucketing is more efficient for sampling,data will be segregated at file level. We need to specify a column on which bucketing has to be taken place(preferably columns without duplicates).Then hash function will be applied on the columns mod by no. of bucket no
CLUSTERED BY (column_name) INTO Bucket no;
Since hash is applying,data will be distributed evenly into multiple files.
Need to be careful if we are using BIG INT or string datatype columns.
Formulae to calculate bucket:
Total data size / HDFS block size.
Choosing an appropriate file format can have some significant benefits:
1. Faster read times
2. Faster write times
3. Splittable files (so you don’t need to read the whole file, just a part of it)
4. Schema evolution support (allowing you to change the fields in a dataset)
5. Advanced compression support (compress the columnar files with a compression codec without sacrificing these features) .
Some file formats are designed for general use (like MapReduce or Spark), others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice.
Pointers regarding file formats
- Default, Json CSV formats are available
- Slow to read and write
- Can’t split compressed files (Leads to Huge maps)
- Need to read/decompress all fields.
Sequence files were originally designed for MapReduce, so the integration is smooth. They encode a key and a value for each record and nothing more. Records are stored in a binary format that is smaller than a text-based format would be. Like text files, the format does not encode the structure of the keys and values, so if you make schema migrations they must be additive. Typically if you need to store complex data in a sequence file you do so in the value part while encoding the id in the key. The problem with this is that if you add or change fields in your Writable class it will not be backwards compatible with the data stored in the sequence file. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks.
- Traditional map reduce binary file format
- Stores Keys and Values as a class
- Not good for Hive ,Which has sql types
- Hive always stores entire line as a value
- Default block size is 1 MB
- Need to read and Decompress all the fields
- Support Split
- Serialization and deserialization framework
- Support block chain compression
- Supports schema evaluation
- Support multiple languages(python,R etc.)
Parquet file format is also a columnar format. Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records. Just like ORC file, it’s great for compression with great query performance especially efficient when querying data from specific columns.
- Design based on googles Dreamel paper
- Schema segregated into footer
- Column major format with stripes
- Simple type-model with logical types
- All data pushed to leaves of the tree
- Integrated compression and indexes
RC stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows used when we want to perform operations on multiple rows at a time
- columns stored separately
- Read and decompressed only needed one.
- Better compression
- Columns stored as binary Blobs
- Depend on Meta store to supply Data types
- Large Blocks — 4MB default
- Still search file for split boundary
ORC (Optimized Row Columnar):
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases and shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data. We cannot load data into ORCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created ORCFILE. ORC File Format Full Form is Optimized Row Columnar File Format.ORC File format provides very efficient way to store relational data then RC file,By using ORC File format we can reduce the size of original data up to 75%.Comparing to Text,Sequence,Rc file formats ORC is better
- Column stored separately
- Knows Types — Uses Types specific en-coders
- Stores statistics (Min,Max,Sum,Count)
- Has Light weight Index
- Skip over blocks of rows that that don’t matter
- Larger Blocks — 256 MB by default, Has an index for block boundaries
Using ORC files improves performance when Hive is reading, writing, and processing data comparing to Text,Sequence and Rc. RC and ORC shows better performance than Text and Sequence File formats. Comparing to RC and ORC File formats always ORC is better as ORC takes less time to access the data comparing to RC File Format and ORC takes Less space space to store data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. ORC File format feature comes with the Hive 0.11 version and cannot be used with previous versions.
Why we need Data Compression
In data intensive hadoop workloads,I/O operation and network data transfer takes considerably long amount of time to complete.In addition to this internal MapReduce “Shuffle” process is also under huge I/O pressure as it has to often “spill out” intermediate data to local disks before advancing from Map phase to Reduce Phase
For any Hadoop cluster Disk I/O and network bandwidth are considered as a precious resource ,which should be allocated accordingly.So use of compressed files for storage not only saves disk space ,but also speed ups the data transfer across the network .Also when running a large volume MapReduce jobs ,combination of data compression and decreased network load brings a significant performance improvements as the I/O and network resource consumptions are reduced throughout the MapReduce process pipeline.
Enable compression in hive:
set hive.exec.compress.intermediate = true
set hive.exec.compress.output =true
- Compression speed is good than snappy but write speed is slower than snappy
- If we have small blocks of data then we can use this
- Compression is faster & decompression is slow
- It is 10 times slower than Gzip
- Suitable for Text files
- Need to use index & also need to install codec related to index separatly
- No need for exteranl indexes
- Compression and decompression library
- High speed with reasonalbe compression speed
- Non Splittable(But splittable with Parquet file combination)
- Better performace with sequencial & Avro file formats
- Processing performance is better than others.
Sampling is concerned with the selection of a subset of data from a large dataset to run queries and verify results. The dataset may be too large to run queries on the whole data. Therefore in development and testing phases it is a good idea to run queries on a sample of dataset
We can run Hive queries on a sample of data using the TABLESAMPLE clause. Any column can be used for sampling the data. We need to provide the required sample size in the queries.
Sampling by Bucketing
We can use TABLESAMPLE clause to bucket the table on the given column and get data from only some of the buckets.
TABLESAMPLE (BUCKET x OUT OF y [ON colname])
colname indicates the column to be used to bucket the data into y buckets[1-y]. All the rows which are in the bucket x are returned.
If the table is not bucketed on the column(s) used in sampling, TABLESAMPLE will scan the entire table and fetch the sample.
If the hive table is bucketed on some column(s), then we can directly use that column(s) to get a sample. In this case Hive need not read all the data to generate sample as the data is already organized into different buckets using the column(s) used in the sampling query. Hive will read data only from some buckets as per the size specified in the sampling query.
Block sampling allows Hive to select at least n% data from the whole dataset. Sampling granularity is at the HDFS block size level. If HDFS block size is 64MB and n% of input size is only 10MB, then 64MB of data is fetched.
Enable Tez engine in order to increase query performance. Internally tez will trigger vectorization which helps in increasing query retrieval performance.
set hive.execution.engine =tez;
- Tez internally triggers vectorization,DAG’s will be created.
- Instead of processing single row at a time,it will trigger 1024 rows at a time
- Internally each column in 1024 rows were considered as a single columns &will be iterated.
- Supports only ORC format,if we try using other format then it wont works
set hive.vectorized.execution.enabled =true
By default hive will execute its jobs in stages one by one,they might be MR,Sampling,merge etc… We can make them to be executed in parallel by enabling parallel execution.
Set hive.exec.parallel =true;
Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.
The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, a query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.
Calcite applies various optimizations such as query rewrite, JOIN reordering, and deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.
Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimised and converted to Tez jobs, and then executed on the Hadoop cluster.
hive.stats.fetch.column.stats = true
hive.compute.query.using.stats = true