Our Insights

Innovative Ways - Satisfied Clientele

get a free quote

Introduction to .NET Apache Spark

iFour Team - December 04, 2020

Listening is fun too.

Straighten your back and cherish with coffee - PLAY !

Table of Content

1.What is Apache Spark?
2.Why do we use it?
3.Common big data scenarios
4.Conclusion

.Net Core is free open-source platform that was developed by Microsoft along with .net core is a cross platform that supports all most operating systems like macOS, Windows and Linux.

What is Apache Spark?

Apache-Spark is an open-source,universal-purpose distributed processing enginefor analytics over large data setscommonly terabytes or petabytes of data. Apache spark can be used in Machine learning, real-time streams, ad-hoc query, and big data..

Why do we use it?

Apache spark is fast because processing tasks are distributed over a cluster of nodes, and data is cached in memory, to reduce computation time.It is faster than Hadoop and used with java,R,Python, SQL, and .NET technology.

Common big data scenarios

Spark is a general-purpose distributed processing engine that can be used for following big data scenarios.

Extract, transform, and load (ETL)

Extract, transform, and load (ETL) is the process of collecting multiple data from one or multiple sources, modifying the data, and moving the data to a new data store. Following are ways of transforming data:

Filtering
Sorting
Aggregating
Joining
Cleaning
Deduplicating
Validating

Real-time data stream processing

Apache spark is used to process large streams of datasuch as monitoring streams of sensor data or analyzing financial transactions to detect fraud. Spark streaming is used in apache spark for real-time data stream processing. Streaming or real-time data is an example of data in motion.

Batch processing

Batch processing is the processing of big data for filtering, aggregating, and preparing very large datasets using long-running jobs in parallel.

Machine learning through MLlib

Apache Spark's machine learning library,MLlib, contains several machine learning algorithms and utilities used to forecast or predict future behaviors, outcomes, and trends using existing data.

Graph processing through GraphX

A graph data structure is a collection of nodes connected by edges. A graph database is used when you have hierarchical data or data with interconnected relationships. You might Spark's GraphX API is used to process hierarchical data.

SQL and structured data processing with Spark SQL

Spark SQL is used when you’re working with structured (formatted) data in your spark application.

Get started with .NET ApacheSpark

Prepare your environment for .NET for Apache Spark
Write your first .NET for Apache Spark application
Build and run your .NET for Apache Spark application

Prepare your environment

Step-1 Install .NET

Download .NET SDK from https://dotnet.microsoft.com/ and install it and add dotnet toolchain to your PATH.

If you install .NET core SDK, open the command prompt, and run dotnetto verify successful dotnet installation.

Figure 1. command prompt

Step-2 Install Java SDK

Download .NET SDK from https://www.oracle.com/java/technologies/ and install it and set an environment variable.

If you install JAVA SDK, open the command prompt, and run java -version to verify successful JAVA installation.

Figure 2 Command prompt

Step-3 Install Compression software

Download https://www.7-zip.org/ or https://www.winzip.com/win/en/ software to extract the apache-spark file(Downloaded as .tgz file)

Step-4 Install Apache spark

Download the latest stable version of apache spark from https://spark.apache.org/downloads.html and extract this file using 7-Zip software.

To extract the nested .tar file:

Open the download folder and locate the spark-3.0.1-bin-hadoop2.7.tgz file.
Right-click on the file and select 7-Zip -> Extract here.
spark-3.0.1-bin-hadoop2.7.tar is created inside the .tgz file you downloaded.

To extract the Apache Spark files:

Right-click on spark-3.0.1-bin-hadoop2.7.tgz.tar and select 7-Zip -> Extract files.
Enter C:\bin in the Extract to the field
Uncheck the checkbox below the Extract to the field.
Select OK.
The Apache Spark files are extracted to C:\bin\spark-3.0.1-bin-hadoop2.7\

Figure 3 Extraction using 7-ZIP

Step-3 Set the environment variable

To set environment variable run following command to locate apache spark.

setxHADOOP_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\

Figure 4 Set environment variable

Open a command prompt and run the following code for verifying successful installation

%SPARK_HOME%\bin\spark-submit --version

Figure 5 Verify installation

Step-4 Install .Net for Apache spark

Download Microsoft.Spark.A worker from https://github.com/dotnet/spark/releasesand extract file.

To extract the Microsoft.Spark.Worker:

Locate the Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipfile that you downloaded.

Right-click on Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipand select 7-Zip -> Extract files. option.
Enter C:\bin in the Extract to the field.
Uncheck the checkbox below the Extract to the field.
Select OK.
Set environment variable using following command

setx DOTNET_WORKER_DIR " C:\bin\Microsoft.Spark.Worker-1.0.0"

Step-5 Install WinUtils

Download WinUtils from https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe.

After downloading copy this file to C:\bin\spark-3.0.1-bin-hadoop2.7\bin

Write .Net for Apache spark app Step-1 Create a console app.

To create a console app, open the command prompt and write the following code.

Figure 6Create an application

Once the application is created type cd FirstApacheSparkApp

Planning to Hire Dedicated ASP.Net Developer Your Search ends here.

Connect us now

Step-2 Install NuGet Package

Install Microsoft. Spark package to use .Net for Apache spark in your application. Open the command prompt and write the following code.

dotnet add package Microsoft.Spark

Step-3 Write your application

Open Program.cs file in Visual studio code or any editor and place following code:

Program.cs

using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;

namespace FirstApacheSparkApp
{
    class Program
    {
        static void Main(string[] args)
        {
         SparkSession spark =
                SparkSession
                    .Builder()
                    .AppName("word_count_sample")
                    .GetOrCreate();
            string filePath = "data.txt";
            DataFrame dataFrame = spark.Read().Text(filePath);
DataFrame words =
                dataFrame
                    .Select(Split(Col("value")," ").Alias("words"))
                    .Select(Explode(Col("words")).Alias("word"))
                    .GroupBy("word")
                    .Count()
                    .OrderBy(Col("count").Desc());

            words.Show();
spark.Stop();
        }
    }
}

Step-4 Create a data file

Application is used to process a file containing text. Create data.txt file in your application.

data.txt

viratkohli is known as the runmachine
rohitsharma is known as the hitman
hardikpandya is Known as kunfupandya
mahendrasinhDhoni is Known as captain cool

Step-5 Run your .Net For Apache spark application

Build your application using the following code

Dotnet build

Run your application using the following code:

%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp2.1\microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin\Debug\netcoreapp2.1\FirstApacheSparkApp.dll

Figure 7 Run your application

Conclusion

In this blog, we have discussed the .NET for Apache Spark and create console applications using .Net Apache spark. .Net spark is used to provide accessibility for spark API which can communicate with your application. .Net spark is used with various techniques like .net, R, python, etc.

Technology Stacks

Technology that meets your business requirements

Planning a cutting-edge technology software solution? Our team can assist you to handle any technology challenge. Our custom software development team has great expertise in most emerging technologies. At iFour, our major concern is towards driving acute flexibility in your custom software development. For the entire software development life-cycle, we implement any type of workflow requested by the client. We also provide a set of distinct flexible engagement models for our clients to select the most optimal solution for their business. We assist our customers to get the required edge that is needed to take their present business to next level by delivering various out-reaching technologies.