Spark

Blogs post related to Apache Spark

Spark

Resolve “Job aborted due to stage failure” in Spark

ByJerry Richard June 22, 2023

When it comes to troubleshooting Spark issues. One thing you get used to it is knowing what the error exactly…

Spark

Resolve “Could not find CoarseGrainedScheduler” in Spark

ByJerry Richard June 21, 2023

In this article, we will understand and learn about the CoarseGrainedScheduler and why we are encountering this error in the…

Spark

FIX – TypeError: an integer is required (got type bytes)

ByJerry Richard June 19, 2023

In this article, we will learn about the “TypeError: an integer is required (got type bytes)” that occurs in PySpark…

Spark

How to Save DataFrame as a CSV File in Spark

ByJerry Richard June 15, 2023

Spark provides a lot of APIs to save DataFrame to multiple formats like CSV, Parquet, Hive tables, etc. In this…

Spark

How to Save Spark DataFrame directly to Hive

ByJerry Richard May 31, 2023

I hope you have encountered a similar situation, Where you wanted to do some manipulation on a spark dataframe and…

Spark

Spark Driver in Apache Spark and Where does the spark driver run?

ByJerry Richard May 26, 2023

Drivers are the one that starts the spark context or session in Spark, which helps in communicating with resource managers and runs tasks in

Spark

What are broadcast variables in Spark

ByJerry Richard May 26, 2023

Broadcast variables are commonly used by Spark developers to optimize their code for better performance. This article will provide a…

Spark

Resolve the “Container killed by YARN for exceeding memory limits” Issue in Hive, Tez, and Spark jobs

ByJerry Richard May 26, 2023

“Container killed by YARN for exceeding memory limits” usually happens, When the JVM usage goes beyond the Yarn container memory…

Spark

Why Spark/MR not considering UTF-8 encoding

ByJerry Richard May 26, 2023

Reading/WRITING UTF-8 enabled file Sometimes, we could have encountered issues in which Spark returns non-ASCII characters in the wrong format….

Spark

How to read and write Excel files with Spark

ByJerry Richard May 26, 2023

Apache Spark is a powerful data processing framework, Commonly, Spark is used to process data stored in various formats, including…

Spark

Difference between groupByKey and reduceByKey in Spark

ByJerry Richard May 26, 2023

groupByKey and reduceByKey are the two different operations that help to transform RDD (Resilient Distributed Datasets). What is the difference…

Spark

Understanding the Spark stack function for pivoting data

ByJerry Richard May 26, 2023

Hello! If you’re into big data processing, you’ve probably heard of Spark, right? It’s a popular distributed computing framework used…

Spark

How to set Apache Spark Executor and Driver Memory/Cores ( pyspark and spark-submit)

ByJerry Richard May 26, 2023

In multiple cases, We need to increase the Driver/executors memory/cores to improve performance or to avoid Out of Memory issues

Spark

How to Kill Running Yarn Application ( Spark, Hive, and Tez)

ByJerry Richard May 26, 2023

One of the easiest ways to kill a Spark application is by issuing the “yarn kill” command

Spark

How to Access Kudu table from Spark

ByJerry Richard May 26, 2023

There are multiple use cases, Where we need to access Kudu from spark to store and retrieve data, In this…

Spark

Yarn application stuck in the ACCEPTED state (Includes Spark, Hive, Tez, and MapReduce jobs)

ByJerry Richard May 26, 2023

Usually, the Yarn application will stuck in the ACCEPTED state, When it didn’t find enough resources to create a new container in the cluste

Spark

How to run Spark job with Ozone Filesystem

ByJerry Richard May 26, 2023

Apache Spark is a popular distributed computing framework for big data processing and Ozone is a distributed object store that…

Spark

What is the difference between apache spark and pyspark

ByJerry Richard May 26, 2023

Have you been wondering what the difference is between Apache Spark and Pyspark, and which one to use for big…

Spark

How to Enable Debug Mode in spark-submit, spark-shell, and pyspark

ByJerry Richard May 26, 2023

We often need to enable debug log level in the spark to understand the issue and troubleshoot, In this article,…

Spark

Script to collect thread dump (Jstack)

ByJerry Richard May 26, 2023

Jstack is a command line tool that helps to capture the thread dump of the java process. Using the thread…

Spark

Handling Data Skew in Apache Spark Application

ByJerry Richard May 26, 2023

What is Data skew? Let’s take a basic example of “CONSTRUCTION WORKERS“ In the above example: Skew happened due to…

Spark

How to Run the Spark history server locally

ByJerry Richard May 26, 2023

Sharing a step-by-step guide to the setup of the Spark history server locally (Mac or Windows). This helps to debug…

Spark

Difference between DataFrame, Dataset, and RDD in Spark

ByJerry Richard May 26, 2023

Short History of Spark: — Spark was created in Berkeley back in 2009 — An evolution of the MapReduce concepts…

Spark

What is the difference between Cache and Checkpoint in Spark

ByJerry Richard May 26, 2023

Spark is a data processing framework that helps to process data faster. It uses in-memory and multiple nodes to run…

Spark

Resolve “Task serialization failed: java.lang.StackOverflowError” in Spark

ByJerry Richard May 26, 2023

“Task serialization failed: java.lang.StackOverflowError” usually happens, When the JVM encounters a situation where it is unable to create a…

Spark

How to Enable Kerberos Debugging for Spark Application and “hdfs dfs” Command

ByJerry Richard May 26, 2023

Kerberos debugging involves enabling debug log level for the Krb5LoginModule module at the JVM level, This would help us to…

Spark

Resolve “org.apache.hadoop.hive.serde2.SerDeException: Unexpected tag” in Spark and Hive

ByJerry Richard May 26, 2023

We usually see the ERROR “org.apache.hadoop.hive.serde2.SerDeException: Unexpected tag” in Spark, When you are trying to connect the hive…

Spark

Total size of serialized results of tasks (1024.5 MB) is bigger than spark.driver.maxResultSize

ByJerry Richard May 26, 2023

“failure: Total size of serialized results of x tasks (1024.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)” in Spark

Spark

spark.driver.memoryOverhead and spark.executor.memoryOverhead explained

ByJerry Richard May 26, 2023

In this article, We will learn about memory overhead configuration in spark and explore more about spark.driver.memoryOverhead & spark.executor.memoryOverhead and…

Spark

“Futures timed out” issue in spark

ByJerry Richard May 26, 2023

“Futures timed out” is a common error that can occur when running Spark applications. In this article, We will learn,…

Spark

How to Access HBase from Spark

ByJerry Richard May 26, 2023

As we all know, Spark is an open-source, distributed processing framework used in big data, It helps perform analytics on…

Spark

Difference between map and flatMap in Spark

ByJerry Richard May 26, 2023

Apache Spark is a powerful distributed framework that leverages in-memory caching and optimized query execution to produce faster results. The…

Spark

How to read and write XML files using Spark?

ByJerry Richard May 26, 2023

Spark is a powerful framework for processing large datasets in a distributed manner. In this article, we will discuss, how…

Spark

Difference between map and mapValues functions in Spark

ByJerry Richard May 26, 2023

Spark is a distributed framework, Which uses in-memory computation power to process a large volume of data much faster. One…

Spark

Resolve the “java.lang.OutOfMemoryError: Java heap space” issue in Spark and Hive(Tez and MR)

ByJerry Richard May 26, 2023

OutOfMemoryError is not a surprise for spark as it is a memory-centric framework, To deal with memory issues, We need…