Sai Prabhanj TuragaSpark — “_SUCCESS” fileIn Apache Spark, when you write the output of a job to a file system, particularly with file-based output formats (like Parquet, CSV, JSON…Dec 6, 2023Dec 6, 2023
Sai Prabhanj TuragaSpark — Delta lakeDelta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark…Nov 12, 2023Nov 12, 2023
Sai Prabhanj TuragaSpark MLlib — Basic IntroApache Spark MLlib (Machine Learning Library) is a scalable and versatile machine learning library built on top of the Apache Spark…Nov 8, 2023Nov 8, 2023
Sai Prabhanj TuragaSpark Streaming — Batch Intervals TuningTuning the batch interval in Spark Streaming is a critical aspect of optimizing the performance and behavior of your real-time data…Nov 8, 2023Nov 8, 2023
Sai Prabhanj TuragaSpark Streaming — ReceiversIn Apache Spark Streaming, Receivers are a critical component for ingesting data from various external sources and converting it into a…Nov 8, 2023Nov 8, 2023
Sai Prabhanj TuragaSpark — DStreamsA Discretized Stream (DStream) is a fundamental abstraction in Apache Spark Streaming. It represents a continuous stream of data that is…Nov 8, 2023Nov 8, 2023
Sai Prabhanj TuragaSpark — Monitoring JobsMonitoring Spark jobs is crucial for ensuring the health and performance of your Apache Spark applications. Effective monitoring allows…Nov 7, 2023Nov 7, 2023
Sai Prabhanj TuragaSpark — Streaming Basic IntroApache Spark Streaming is a real-time data processing and analysis framework built on top of the Apache Spark core. It enables the…Nov 7, 2023Nov 7, 2023
Sai Prabhanj TuragaSpark — User-Defined Functions (UDFs)User-Defined Functions (UDFs) in Apache Spark provide a way to extend the functionality of Spark by allowing you to define custom…Nov 7, 2023Nov 7, 2023
Sai Prabhanj TuragaSpark — What happens when one of the RDD dataset fails within a spark job ?In Apache Spark, when one of the RDD (Resilient Distributed Dataset) partitions or stages of a job fails, the Spark framework has built-in…Nov 7, 2023Nov 7, 2023
Sai Prabhanj TuragaSpark — Debug/Analyze jobs in ProductionIf a Spark job in production is stuck or not making progress as expected, it’s essential to diagnose and resolve the issue promptly to…Nov 7, 2023Nov 7, 2023
Sai Prabhanj TuragaSpark — AccumulatorIn Apache Spark, an accumulator is a special type of shared variable that is used for aggregating data across multiple tasks or nodes in a…Nov 7, 2023Nov 7, 2023
Sai Prabhanj TuragaSpark — Collect()The collect action in Apache Spark is used to retrieve all the data from a distributed DataFrame or RDD (Resilient Distributed Dataset)…Nov 6, 2023Nov 6, 2023
Sai Prabhanj TuragaSpark functions — from_jsonThe from_json function in Apache Spark is used to parse JSON strings in a DataFrame column and convert them into structured data, such as…Nov 6, 2023Nov 6, 2023
Sai Prabhanj TuragaSpark — Handling DuplicatesHandling duplicate rows or specific column duplicates in Spark can be important when working with large datasets. You can identify and…Nov 6, 2023Nov 6, 2023
Sai Prabhanj TuragaSpark Performance Tuning Quick ReferenceFew key points to remember while doing building spark applications to optimise performanceMay 11, 2020May 11, 2020
Sai Prabhanj TuragaSpark Executor Scale downWhen we are running spark streaming applications with dynamic allocations of executors, we might observe scenario like even though there is…Aug 18, 2021Aug 18, 2021
Sai Prabhanj TuragaSpark SQL Optimization PointersOptimizing Spark SQL queries is crucial for improving the performance and efficiency of your Spark jobs. Here are some tips to help you…Jul 20, 2023Jul 20, 2023
Sai Prabhanj TuragaSpark Optimize Intermediate DataManaging and optimizing intermediate data in a Spark job is crucial to avoid performance bottlenecks and disk space issues. Here are some…Jul 23, 2023Jul 23, 2023