Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance
The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. …
Kafka is a publish-subscribe based durable messaging system exchanging data between processes, applications, and servers
Key components/terminologies in Kafka Architecture:
Producers, Topic, Consumers, Broker ,Consumer Group, Partitions, Offset ,ZooKeeper ,Replications ,Leader, Kafka API’s
Kafka basic flow:
Kafka producers write to topics, while Kafka consumers read from topics. Topics represent commit log data structures stored on disk. Kafka adds records written by producers to the ends of those topic commit logs. Topic logs are also made up of multiple partitions, straddling multiple files and potentially multiple cluster nodes. Consumers can use offsets to read from certain locations within topic logs. …
Let’s have a look about few hive table configuration properties
Hive provides an option to create mutable table and immutable table.
All the tables by default are mutable. Mutable table allows appending the data when data already present in table.
A Table can be created as immutable table, by setting its table property to True. By default this property is false.
create table if not exists Test_immutable (id int, name string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile tblproperties (“immutable”=”true”);
Immutable property allows us to load the data for…
Few key points to remember while doing building spark applications to optimise performance