Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance

  • Partitions
  • Bucketing
  • File formats
  • Compression
  • Sampling
  • Tez
  • Vectorization
  • Parallel execution
  • CBO

Partitions :

The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. …

Kafka is a publish-subscribe based durable messaging system exchanging data between processes, applications, and servers

Key components/terminologies in Kafka Architecture:

Producers, Topic, Consumers, Broker ,Consumer Group, Partitions, Offset ,ZooKeeper ,Replications ,Leader, Kafka API’s

Kafka basic flow:

Kafka producers write to topics, while Kafka consumers read from topics. Topics represent commit log data structures stored on disk. Kafka adds records written by producers to the ends of those topic commit logs. Topic logs are also made up of multiple partitions, straddling multiple files and potentially multiple cluster nodes. Consumers can use offsets to read from certain locations within topic logs. …

Let’s have a look about few hive table configuration properties

Mutable and Immutable

Hive provides an option to create mutable table and immutable table.

Mutable Table

All the tables by default are mutable. Mutable table allows appending the data when data already present in table.

Immutable Table

A Table can be created as immutable table, by setting its table property to True. By default this property is false.

create table if not exists Test_immutable (id int, name string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile tblproperties (“immutable”=”true”);

Immutable property allows us to load the data for…

Few key points to remember while doing building spark applications to optimise performance

  1. Spark UI (Monitor and Inspect Jobs).
  2. Level of Parallelism (Clusters will not be fully utilised unless the level of parallelism for each operation is high enough. Spark automatically sets the number of partitions of an input file according to its size and for distributed shuffles, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument to an operation. In general, 2–3 tasks per CPU core in your cluster are recommended. That said…

Sai Prabhanj Turaga

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store