List: Apache Spark | Curated by Sai Prabhanj Turaga

May 1, 2024
53 stories
1 save
Apache Spark
Sai Prabhanj Turaga
What is Z – Ordering
May 1, 2024
May 1, 2024
Sai Prabhanj Turaga
Spark — “_SUCCESS” fileIn Apache Spark, when you write the output of a job to a file system, particularly with file-based output formats (like Parquet, CSV, JSON…
Dec 6, 2023
Dec 6, 2023
Sai Prabhanj Turaga
Spark — Delta lakeDelta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark…
Nov 12, 2023
Nov 12, 2023
Sai Prabhanj Turaga
Spark MLlib — Basic IntroApache Spark MLlib (Machine Learning Library) is a scalable and versatile machine learning library built on top of the Apache Spark…
Nov 8, 2023
Nov 8, 2023
Sai Prabhanj Turaga
Spark Streaming — Batch Intervals TuningTuning the batch interval in Spark Streaming is a critical aspect of optimizing the performance and behavior of your real-time data…
Nov 8, 2023
Nov 8, 2023
Sai Prabhanj Turaga
Spark Streaming — ReceiversIn Apache Spark Streaming, Receivers are a critical component for ingesting data from various external sources and converting it into a…
Nov 8, 2023
Nov 8, 2023
Sai Prabhanj Turaga
Spark — DStreamsA Discretized Stream (DStream) is a fundamental abstraction in Apache Spark Streaming. It represents a continuous stream of data that is…
Nov 8, 2023
Nov 8, 2023
Sai Prabhanj Turaga
Spark — Monitoring JobsMonitoring Spark jobs is crucial for ensuring the health and performance of your Apache Spark applications. Effective monitoring allows…
Nov 7, 2023
Nov 7, 2023
Sai Prabhanj Turaga
Spark — Streaming Basic IntroApache Spark Streaming is a real-time data processing and analysis framework built on top of the Apache Spark core. It enables the…
Nov 7, 2023
Nov 7, 2023
Sai Prabhanj Turaga
Spark — User-Defined Functions (UDFs)User-Defined Functions (UDFs) in Apache Spark provide a way to extend the functionality of Spark by allowing you to define custom…
Nov 7, 2023
Nov 7, 2023
Sai Prabhanj Turaga
Spark — What happens when one of the RDD dataset fails within a spark job ?In Apache Spark, when one of the RDD (Resilient Distributed Dataset) partitions or stages of a job fails, the Spark framework has built-in…
Nov 7, 2023
Nov 7, 2023
Sai Prabhanj Turaga
Spark — Debug/Analyze jobs in ProductionIf a Spark job in production is stuck or not making progress as expected, it’s essential to diagnose and resolve the issue promptly to…
Nov 7, 2023
Nov 7, 2023
Sai Prabhanj Turaga
Spark — AccumulatorIn Apache Spark, an accumulator is a special type of shared variable that is used for aggregating data across multiple tasks or nodes in a…
Nov 7, 2023
Nov 7, 2023
Sai Prabhanj Turaga
Spark — Collect()The collect action in Apache Spark is used to retrieve all the data from a distributed DataFrame or RDD (Resilient Distributed Dataset)…
Nov 6, 2023
Nov 6, 2023
Sai Prabhanj Turaga
Spark functions — from_jsonThe from_json function in Apache Spark is used to parse JSON strings in a DataFrame column and convert them into structured data, such as…
Nov 6, 2023
Nov 6, 2023
Sai Prabhanj Turaga
Spark — Handling DuplicatesHandling duplicate rows or specific column duplicates in Spark can be important when working with large datasets. You can identify and…
Nov 6, 2023
Nov 6, 2023
Sai Prabhanj Turaga
Spark Performance Tuning Quick ReferenceFew key points to remember while doing building spark applications to optimise performance
May 11, 2020
May 11, 2020
Sai Prabhanj Turaga
Spark Executor Scale downWhen we are running spark streaming applications with dynamic allocations of executors, we might observe scenario like even though there is…
Aug 18, 2021
Aug 18, 2021
Sai Prabhanj Turaga
Spark SQL Optimization PointersOptimizing Spark SQL queries is crucial for improving the performance and efficiency of your Spark jobs. Here are some tips to help you…
Jul 20, 2023
Jul 20, 2023
Sai Prabhanj Turaga
Spark Optimize Intermediate DataManaging and optimizing intermediate data in a Spark job is crucial to avoid performance bottlenecks and disk space issues. Here are some…
Jul 23, 2023
Jul 23, 2023