0.4 C
New York
Wednesday, February 4, 2026

Introducing the Apache Spark troubleshooting agent for Amazon EMR and AWS Glue


The newly launched Apache Spark troubleshooting agent can eradicate hours of guide investigation for knowledge engineers and scientists working with Amazon EMR or AWS Glue. As an alternative of navigating a number of consoles, sifting by intensive log recordsdata, and manually analyzing efficiency metrics, now you can diagnose Spark failures utilizing easy pure language prompts. The agent routinely analyzes your workloads and delivers actionable suggestions. reworking a time-consuming troubleshooting course of right into a streamlined, environment friendly expertise.

On this put up, we present you ways the Apache Spark troubleshooting agent helps analyze Apache Spark points by offering detailed root causes and actionable suggestions. You’ll discover ways to streamline your troubleshooting workflow by integrating this agent along with your present monitoring options throughout Amazon EMR and AWS Glue.

Apache Spark powers important ETL pipelines, real-time analytics, and machine studying workloads throughout 1000’s of organizations. Nevertheless, constructing and sustaining Spark functions stays an iterative course of the place builders spend vital time troubleshooting. Spark software builders encounter operational challenges due to a couple completely different causes:

  • Complicated connectivity and configuration choices to a wide range of sources with Spark – Though this makes Spark a preferred knowledge processing platform, it usually makes it difficult to seek out the basis reason for inefficiencies or failures when Spark configurations aren’t optimally or appropriately configured.
  • Spark’s in-memory processing mannequin and distributed partitioning of datasets throughout its staff – Though good for parallelism, this usually makes it troublesome for customers to establish inefficiencies. This ends in sluggish software execution or root reason for failures attributable to useful resource exhaustion points resembling out of reminiscence and disk exceptions.
  • Lazy analysis of Spark transformations – Though lazy analysis optimizes efficiency, it makes it difficult to precisely and shortly establish the appliance code and logic that precipitated the failure from the distributed logs and metrics emitted from completely different executors.

Apache Spark troubleshooting agent structure

This part describes the parts of the troubleshooting agent and the way they hook up with your growth surroundings. The troubleshooting agent supplies a single conversational entry level to your Spark functions throughout Amazon EMR, AWS Glue, and Amazon SageMaker Notebooks. As an alternative of navigating completely different consoles, APIs, and log areas for every service, you work together with one Mannequin Context Protocol (MCP) server by pure language utilizing any MCP-compatible AI assistant of your selection, together with customized brokers you develop utilizing frameworks resembling Strands Brokers.

Working as a totally managed cloud-hosted MCP server, the agent removes the necessity to keep native servers whereas holding your knowledge and code remoted and safe in a single-tenant system design. Operations are read-only and backed by AWS Id and Entry Administration (IAM) permissions; the agent solely has entry to sources and actions your IAM function grants. Moreover, device calls are routinely logged to AWS CloudTrail, offering full auditability and compliance visibility. This mixture of managed infrastructure, granular IAM controls, and CloudTrail integration confirms your Spark diagnostic workflows stay safe, compliant, and totally auditable.

The agent builds on years of AWS experience working tens of millions of Spark functions at scale. It routinely analyzes Spark Historical past Server knowledge, distributed executor logs, configuration patterns, and error stack traces and extracts related options and indicators to floor insights that will in any other case require guide correlation throughout a number of knowledge sources and deep understanding of Spark and repair internals.

Getting began 

Full the next steps to get began with the Apache Spark troubleshooting agent.

Conditions

Confirm you meet or have accomplished the next stipulations.

System necessities:

  • Python 3.10 or greater
  • Set up the uv package deal supervisor. For directions, see putting in uv.
  • AWS Command Line Interface (AWS CLI) (model 2.30.0 or later) put in and configured with applicable credentials.

IAM permissions: Your AWS IAM profile wants permissions to invoke the MCP server and entry your Spark workload sources. The AWS CloudFormation template within the setup documentation creates an IAM function with the required permissions. You can too manually add the required IAM permissions.

Arrange utilizing AWS CloudFormation

First, deploy the AWS CloudFormation template offered within the setup documentation. This template routinely creates the IAM roles with the permissions required to invoke the MCP server.

  1. Deploy the template throughout the similar AWS Area you run your workloads in. For this put up, we’ll use us-east-1.
  2. From the AWS CloudFormation Outputs tab, copy and execute the surroundings variable command:
    export SMUS_MCP_REGION=us-east-1 && export IAM_ROLE=arn:aws:iam::111122223333:function/spark-troubleshooting-role-xxxxxx

  3. Configure your AWS CLI profile:
    aws configure set profile.smus-mcp-profile.role_arn ${IAM_ROLE}
    aws configure set profile.smus-mcp-profile.source_profile default
    aws configure set profile.smus-mcp-profile.area ${SMUS_MCP_REGION}

Arrange utilizing Kiro CLI

You should utilize Kiro CLI to work together with the Apache Spark troubleshooting agent immediately out of your terminal.

Set up and configuration:

  1. Set up Kiro CLI.
  2. Add each MCP servers, utilizing the surroundings variables from the earlier Arrange utilizing AWS CloudFormation part:
    # Add Spark Troubleshooting MCP Server
    kiro-cli-chat mcp add 
        --name "sagemaker-unified-studio-mcp-troubleshooting" 
        --command "uvx" 
        --args "["mcp-proxy-for-aws@latest","https://sagemaker-unified-studio-mcp.${SMUS_MCP_REGION}.api.aws/spark-troubleshooting/mcp", "--service", "sagemaker-unified-studio-mcp", "--profile", "smus-mcp-profile", "--region", "${SMUS_MCP_REGION}", "--read-timeout", "180"]" 
        --timeout 180000 
        --scope international
    # Add Spark Code Suggestion MCP Server
    kiro-cli-chat mcp add 
        --name "sagemaker-unified-studio-mcp-code-rec" 
        --command "uvx" 
        --args "["mcp-proxy-for-aws@latest","https://sagemaker-unified-studio-mcp.${SMUS_MCP_REGION}.api.aws/spark-code-recommendation/mcp", "--service", "sagemaker-unified-studio-mcp", "--profile", "smus-mcp-profile", "--region", "${SMUS_MCP_REGION}", "--read-timeout", "180"]" 
        --timeout 180000 
        --scope international

  3. Confirm your setup by working the /instruments command in Kiro CLI to see the accessible Apache Spark troubleshooting instruments.

Arrange utilizing Kiro IDE

Kiro IDE supplies a visible growth surroundings with built-in AI help for interacting with the Apache Spark troubleshooting agent.

Set up and configuration:

  1. Set up Kiro IDE.
  2. MCP configuration is shared throughout Kiro CLI and Kiro IDE. Open the command palette utilizing Ctrl + Shift + P (Home windows / Linux) or Cmd + Shift + P (macOS) and Seek for Kiro: Open MCP Config
  3. Confirm the contents of your mcp.json match the Arrange utilizing Kiro CLI part.

Utilizing the troubleshooting agent

Subsequent, we offer 3 reference architectures for options to make use of the troubleshooting agent in your present workflows with ease. We additionally present the reference code and AWS CloudFormation templates for these architectures within the Amazon EMR Utilities GitHub repository.

Resolution 1 – Conversational troubleshooting: Troubleshooting failed Apache Spark functions with Kiro CLI

When Spark functions fail throughout your knowledge platform, your debugging strategy would usually contain navigating completely different consoles for Amazon EMR, Amazon EC2, Amazon EMR Serverless, and AWS Glue, manually reviewing Spark Historical past Server logs, checking error stack traces, analyzing useful resource utilization patterns, then correlating this info to seek out the basis trigger and repair. The Apache Spark troubleshooting agent automates this complete workflow by pure language, offering a unified troubleshooting expertise throughout the three platforms. Merely describe your failed functions, for instance:

# Amazon EMR-EC2
Debug my failing Amazon EMR-EC2 step. Cluster id: 'j-xxxxx' Step id: 's-xxxxx'
# Amazon EMR Serverless
Troubleshoot my Amazon EMR Serverless job. Utility id: 'xxxxx' Job run id: 'xxxxx'
# AWS Glue
Analyze my failed AWS Glue job. Job identify: 'my-etl-job' Job run id: 'jr_xxxxx'

The agent routinely extracts Spark occasion logs and metrics, analyzes the error patterns, and supplies a transparent root trigger clarification together with suggestions, all by the identical conversational interface. The next video demonstrates the entire troubleshooting workflow throughout Amazon EMR-EC2, Amazon EMR Serverless, and AWS Glue utilizing Kiro CLI:

Resolution 2 – Agent-driven notifications: Combine the Apache Spark troubleshooting agent right into a monitoring workflow 

Along with troubleshooting from the command line, the troubleshooting agent can plug into your monitoring infrastructure to supply improved failure notifications.

Manufacturing knowledge pipelines require quick visibility when failures happen. Conventional monitoring methods can provide you with a warning when a Spark job fails, however diagnosing the basis trigger nonetheless requires guide investigation and an evaluation of what went unsuitable earlier than remediation can start.

With the Apache Spark troubleshooting agent, you possibly can combine it into your present monitoring workflows to obtain root causes and proposals as quickly as you obtain a failure notification. Right here, we reveal two integration patterns that end in automated root trigger evaluation inside your present workflows.

Apache Airflow Integration

This primary integration sample makes use of Apache Airflow callbacks to routinely set off troubleshooting when Spark job operators fail.

When any Amazon EMR, Amazon EC2, Amazon EMR Serverless, or AWS Glue job operator fails in an Apache Airflow DAG,

  1. A callback invokes the Spark troubleshooting agent inside a separate DAG.
  2. The Spark troubleshooting agent analyzes the problem, establishes the basis trigger, and identifies code repair suggestions.
  3. The Spark troubleshooting agent sends a complete diagnostic report back to a configured Slack channel.

The answer is obtainable within the Amazon EMR Utilities GitHub repository (documentation) for quick integration into your present Apache Airflow deployments with a 1-line change to your Airflow DAGs. The next video demonstrates this integration:

Amazon EventBridge integration

For event-driven architectures, this second sample makes use of Amazon EventBridge to routinely invoke the troubleshooting agent when Spark jobs fail throughout your AWS surroundings.

This integration makes use of an AWS Lambda perform that interacts with the Apache Spark troubleshooting agent by the Strands MCP Consumer.

When Amazon EventBridge detects failures from Amazon EMR-EC2 steps, Amazon EMR Serverless job runs, or AWS Glue job runs, it triggers the AWS Lambda perform which:

  1. Makes use of the Apache Spark troubleshooting agent to investigate the failure
  2. Identifies the basis trigger and generates code repair suggestions
  3. Constructs a complete evaluation abstract
  4. Sends the abstract to Amazon SNS
  5. Delivers the evaluation to your configured locations (electronic mail, Slack, or different SNS subscribers)

This serverless strategy supplies centralized failure evaluation throughout all of your Spark platforms with out requiring modifications to particular person pipelines. The next video demonstrates this integration:

A reference implementation of this resolution is obtainable within the Amazon EMR Utilities GitHub repository (documentation).

Resolution 3 – Clever Dashboards: Use the Apache Spark troubleshooting agent with Kiro IDE to visualise account degree software failures: what failed, why failed and learn how to repair

Understanding the well being of your Spark workloads throughout a number of platforms requires consolidating knowledge from Amazon EMR (each EC2 and Serverless) and AWS Glue. Groups usually construct customized monitoring options by writing scripts to question a number of APIs, combination metrics, and generate reviews which could be time consuming and require lively upkeep.

With Kiro IDE and the Apache Spark troubleshooting agent, you possibly can construct complete monitoring dashboards conversationally. As an alternative of writing customized code to combination workload metrics, you possibly can describe what you need to monitor, and the agent generates a whole dashboard displaying total efficiency metrics, error class distributions for failures, success charges throughout platforms, and important failures requiring quick consideration. Not like conventional dashboards that solely present conventional KPIs and metrics on what software failed, this dashboard makes use of the Spark troubleshooting agent to supply insights to customers on why the functions failed, and how they are often fastened. The next video demonstrates constructing a multi-platform monitoring dashboard utilizing Kiro IDE:

The immediate used throughout the demo:

Construct complete monitoring dashboard for all of my Amazon EMR-EC2 steps, Amazon EMR Serverless jobs, and AWS Glue jobs for the final 30 days. Area: us-east-2. 
Execution Plan:
1. Record all of my Spark functions throughout these companies from the final 30 days. You possibly can retailer any intermediate ends in recordsdata on this folder as .json, however VALIDATE outputs earlier than transferring onto the subsequent step. It is crucial to examine the outcomes earlier than contemplating this completed. You possibly can write python script helpers to realize this. Deal with throttling and different exceptions gracefully. Ensure you cowl all platforms: Amazon EMR-EC2, Amazon EMR Serverless, and AWS Glue.
2. Use the spark-troubleshooting-mcp to assemble failure insights for every of my functions. Save this as .json as nicely. 
3. Then, use this info to assist construct the dashboard as HTML. Identify the file dashboard.html.
Dashboard Necessities:
- Data from all of my Amazon EMR-EC2, Amazon EMR Serverless, and AWS Glue functions must be current
- total success charges throughout platforms
- error class distributions for failures as a pie chart
- failures from final 30 days requiring consideration with root causes and proposals. Embody error class and present the basis causes and proposals as they're returned by the spark-troubleshooting-mcp
- configuration comparisons per every platform. Configuration consists of variations, employee varieties / DPUs, and many others.

Clear up

To keep away from incurring future AWS fees, delete the sources you created throughout this walkthrough:

  • Delete the AWS CloudFormation stack.
  • Should you created an Amazon EventBridge rule for integration, delete these sources.

Conclusion

On this put up, we demonstrated how the Apache Spark troubleshooting agent transforms hours of guide investigation into pure language conversations, considerably decreasing troubleshooting time from hours to minutes and making Spark experience accessible to all. By integrating pure language diagnostics into your present growth instruments—whether or not Kiro CLI, Kiro IDE, or different MCP-compatible AI assistants—your groups can give attention to constructing modern functions as an alternative of debugging failures.


Particular thanks

A particular because of everybody who contributed from engineering and science to the launch of the Spark troubleshooting agent and the distant MCP service: Tony Rusignuolo, Anshi Shrivastava, Martin Ma, Hirva Patel, Pranjal Srivastava, Weijing Cai, Rupak Ravi, Bo Li, Vaibhav Naik, XiaoRun Yu, Tina Shao, Pramod Chunduri, Ray Liu, Yueying Cui, Savio Dsouza, Kinshuk Pahare, Tim Kraska, Santosh Chandrachood, Paul Meighan and Rick Sears.

A particular because of all of our companions who contributed to the launch of the Spark troubleshooting agent and the distant MCP service: Karthik Prabhakar, Suthan Phillips, Basheer Sheriff, Kamen Sharlandjiev, Archana Inapudi, Vara Bonthu, McCall Peltier, Lydia Kautsky, Larry Weber, Jason Berkovitz, Jordan Vaughn, Amar Wakharkar, Subramanya Vajiraya, Boyko Radulov and Ishan Gaur.

Concerning the authors

Jake Zych

Jake is a Software program Growth Engineer at AWS Analytics. He has a deep curiosity in distributed methods and generative AI. In his spare time, Jake likes to create video content material and play board video games.

Maheedhar Reddy Chappidi

Maheedhar is a Senior Software program Growth Engineer at AWS Analytics. He’s captivated with constructing fault-tolerant, dependable distributed methods at scale and generative AI functions for Knowledge Integration. Outdoors of labor, Maheedhar enjoys listening to podcasts and taking part in along with his two-year-old little one.

Vishal Kajjam

Vishal is a Senior Software program Growth Engineer at AWS Analytics. He’s captivated with distributed computing and utilizing ML/AI for designing and constructing end-to-end options to deal with clients’ knowledge integration wants. In his spare time, he enjoys spending time with household and mates.

Arunav Gupta

Arunav is a Software program Growth Engineer at AWS Analytics. He’s captivated with generative AI and orchestration and their makes use of in enhancing developer quality-of-life. In his free time, Arunav enjoys competing in a karting league and exploring new espresso outlets in New York.

Wei Tang

Wei is a Software program Growth Engineer at AWS Analytics. She is robust developer with deep pursuits in fixing recurring buyer issues with distributed methods and AI/ML.

Andrew Kim

Andrew is a Software program Growth Engineer at AWS Analytics, with a deep ardour for distributed methods structure and AI-driven options, specializing in clever knowledge integration workflows and cutting-edge function growth on Apache Spark. Andrew focuses on re-inventing and simplifying options to complicated technical issues, and he enjoys creating internet apps and producing music in his free time.

Jeremy Samuel

Jeremy is a Software program Growth Engineer at AWS Analytics. He has a powerful curiosity in creating distributed methods and generative AI. In his spare time, he enjoys taking part in video video games and listening to music.

Kartik Panjabi

Kartik is a Software program Growth Supervisor at AWS Analytics. His group builds generative AI options for the Knowledge Integration and distributed system for knowledge integration.

Shubham Mehta

Shubham is a Senior Product Supervisor at AWS Analytics. He leads generative AI function growth throughout companies resembling AWS Glue, Amazon EMR, and Amazon MWAA, utilizing AI/ML to simplify and improve the expertise of knowledge practitioners constructing knowledge functions on AWS.

Vidyashankar Sivakumar

Vidyashankar is an utilized scientist within the Knowledge Processing and Experiences group, the place he works on DevOps brokers that simplify and optimize the shopper journey for AWS Massive Knowledge processing companies resembling Amazon EMR and AWS Glue. Outdoors of labor, Vidyashankar enjoys listening to podcasts on present affairs, AI/ML, and AIOps, in addition to following cricket.

Muhammad Ali Gulzar

Muhammad is an Amazon Scholar within the Knowledge Processing Brokers Science group, and an assistant professor within the Laptop Science Division at Virginia Tech. Gulzar’s analysis pursuits lie on the intersection of software program engineering and massive knowledge methods.

Mukul Prasad

Mukul is a Senior Utilized Science Supervisor within the Knowledge Processing and Experiences group. He leads the Knowledge Processing Brokers Science group growing DevOps brokers to simplify and optimize the shopper journey in utilizing AWS Massive Knowledge processing companies together with Amazon EMR, AWS Glue, and Amazon SageMaker Unified Studio. Outdoors of labor, Mukul enjoys meals, journey, pictures, and Cricket.

Mohit Saxena

Mohit is a Senior Software program Growth Supervisor at AWS Analytics. He leads growth of distributed methods with AI/ML-driven capabilities and Brokers to simplify and optimize the expertise of knowledge practitioners that construct large knowledge functions with Apache Spark, Amazon S3 and knowledge lakes/warehouses on the cloud.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles