Introduction to .NET Apache Spark


Isn't it fun,

to instantly listen to a blog on the go? PLAY !

 
 

dot-NET-Apache-Spark

.Net Core is free open-source platform that was developed by Microsoft along with .net core is a cross platform that supports all most operating systems like macOS, Windows and Linux.

Table of Content

What is Apache Spark?

Apache-Spark is an open-source,universal-purpose distributed processing enginefor analytics over large data setscommonly terabytes or petabytes of data. Apache spark can be used in Machine learning, real-time streams, ad-hoc query, and big data.

Why do we use it?

Apache spark is fast because processing tasks are distributed over a cluster of nodes, and data is cached in memory, to reduce computation time.It is faster than Hadoop and used with java,R,Python, SQL, and .NET technology.

Common big data scenarios

Spark is a general-purpose distributed processing engine that can be used for following big data scenarios.


Extract, transform, and load (ETL)

Extract, transform, and load (ETL) is the process of collecting multiple data from one or multiple sources, modifying the data, and moving the data to a new data store. Following are ways of transforming data:

  1. Filtering
  2. Sorting
  3. Aggregating
  4. Joining
  5. Cleaning
  6. Deduplicating
  7. Validating


Real-time data stream processing

Apache spark is used to process large streams of datasuch as monitoring streams of sensor data or analyzing financial transactions to detect fraud. Spark streaming is used in apache spark for real-time data stream processing. Streaming or real-time data is an example of data in motion.


Batch processing

Batch processing is the processing of big data for filtering, aggregating, and preparing very large datasets using long-running jobs in parallel.


Machine learning through MLlib

Apache Spark's machine learning library, MLlib, contains several machine learning algorithms and utilities used to forecast or predict future behaviors, outcomes, and trends using existing data.


Graph processing through GraphX

A graph data structure is a collection of nodes connected by edges. A graph database is used when you have hierarchical data or data with interconnected relationships. You might Spark's GraphX API is used to process hierarchical data.


SQL and structured data processing with Spark SQL

Spark SQL is used when you’re working with structured (formatted) data in your spark application.

Get started with .NET ApacheSpark

  1. Prepare your environment for .NET for Apache Spark
  2. Write your first .NET for Apache Spark application
  3. Build and run your .NET for Apache Spark application



Prepare your environment

Step-1 Install .NET

Download .NET SDK from https://dotnet.microsoft.com/ and install it and add dotnet toolchain to your PATH.

If you install .NET core SDK, open the command prompt, and run dotnetto verify successful dotnet installation.

Figure 1. command prompt
Step-2Install Java SDK

Download .NET SDK from https://www.oracle.com/java/technologies/ and install it and set an environment variable.

If you install JAVA SDK, open the command prompt, and run java -version to verify successful JAVA installation.

Figure 2 Command prompt
Step-3Install Compression software

Download https://www.7-zip.org/ or https://www.winzip.com/win/en/ software to extract the apache-spark file(Downloaded as .tgz file)

Step-3Install Apache spark

Download the latest stable version of apache spark from https://spark.apache.org/downloads.html and extract this file using 7-Zip software.

To extract the nested .tar file:

  1. Open the download folder and locate the spark-3.0.1-bin-hadoop2.7.tgz file.
  2. Right-click on the file and select 7-Zip -> Extract here.
  3. spark-3.0.1-bin-hadoop2.7.tar is created inside the .tgz file you downloaded.

To extract the Apache Spark files:

  1. Right-click on spark-3.0.1-bin-hadoop2.7.tgz.tar and select 7-Zip -> Extract files.
  2. Enter C:\bin in the Extract to the field
  3. Uncheck the checkbox below the Extract to the field.
  4. Select OK.
  5. The Apache Spark files are extracted to C:\bin\spark-3.0.1-bin-hadoop2.7\

Figure 3 Extraction using 7-ZIP
Step-3Set the environment variable

To set environment variable run following command to locate apache spark.

setxHADOOP_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\

Figure 4 Set environment variable

Open a command prompt and run the following code for verifying successful installation

%SPARK_HOME%\bin\spark-submit --version

Figure 5 Verify installation
Step-4Install .Net for Apache spark

Download Microsoft.Spark.A worker from https://github.com/dotnet/spark/releasesand extract file.

To extract the Microsoft.Spark.Worker:

Locate the Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipfile that you downloaded.

  1. Right-click on Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipand select 7-Zip -> Extract files. option
  2. Enter C:\bin in the Extract to the field.
  3. Uncheck the checkbox below the Extract to the field.
  4. Select OK.
  5. Set environment variable using following command
setx DOTNET_WORKER_DIR " C:\bin\Microsoft.Spark.Worker-1.0.0"
Step-5Install WinUtils

Download WinUtils from https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe.

After downloading copy this file to C:\bin\spark-3.0.1-bin-hadoop2.7\bin

Write .Net for Apache spark app Step-1 Create a console app

To create a console app, open the command prompt and write the following code.

Figure 6Create an application

Once the application is created type cd FirstApacheSparkApp


Planning to Hire Dedicated ASP.Net Developer ?

Your Search ends here.


Step-2 Install NuGet Package

Install Microsoft. Spark package to use .Net for Apache spark in your application. Open the command prompt and write the following code.

dotnet add package Microsoft.Spark
Step-3 Write your application

Open Program.cs file in Visual studio code or any editor and place following code:

Program.cs

using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;

namespace FirstApacheSparkApp
{
    class Program
    {
        static void Main(string[] args)
        {
         SparkSession spark =
                SparkSession
                    .Builder()
                    .AppName("word_count_sample")
                    .GetOrCreate();
            string filePath = "data.txt";
            DataFrame dataFrame = spark.Read().Text(filePath);
DataFrame words =
                dataFrame
                    .Select(Split(Col("value")," ").Alias("words"))
                    .Select(Explode(Col("words")).Alias("word"))
                    .GroupBy("word")
                    .Count()
                    .OrderBy(Col("count").Desc());

            words.Show();
spark.Stop();
        }
    }
}
Step-4 Create a data file

Application is used to process a file containing text. Create data.txt file in your application.

data.txt

viratkohli is known as the runmachine
rohitsharma is known as the hitman
hardikpandya is Known as kunfupandya
mahendrasinhDhoni is Known as captain cool
Step-5 Run your .Net For Apache spark application

Build your application using the following code

Dotnet build

Run your application using the following code:

%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp2.1\microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin\Debug\netcoreapp2.1\FirstApacheSparkApp.dll

Figure 7 Run your application

Conclusion

In this blog, we have discussed the .NET for Apache Spark and create console applications using .Net Apache spark. .Net spark is used to provide accessibility for spark API which can communicate with your application. .Net spark is used with various techniques like .net, R, python, etc.

Introduction to .NET Apache Spark

dot-NET-Apache-Spark

.Net Core is free open-source platform that was developed by Microsoft along with .net core is a cross platform that supports all most operating systems like macOS, Windows and Linux.

Table of Content

What is Apache Spark?

Apache-Spark is an open-source,universal-purpose distributed processing enginefor analytics over large data setscommonly terabytes or petabytes of data. Apache spark can be used in Machine learning, real-time streams, ad-hoc query, and big data.

Why do we use it?

Apache spark is fast because processing tasks are distributed over a cluster of nodes, and data is cached in memory, to reduce computation time.It is faster than Hadoop and used with java,R,Python, SQL, and .NET technology.

Common big data scenarios

Spark is a general-purpose distributed processing engine that can be used for following big data scenarios.


Extract, transform, and load (ETL)

Extract, transform, and load (ETL) is the process of collecting multiple data from one or multiple sources, modifying the data, and moving the data to a new data store. Following are ways of transforming data:

  1. Filtering
  2. Sorting
  3. Aggregating
  4. Joining
  5. Cleaning
  6. Deduplicating
  7. Validating


Real-time data stream processing

Apache spark is used to process large streams of datasuch as monitoring streams of sensor data or analyzing financial transactions to detect fraud. Spark streaming is used in apache spark for real-time data stream processing. Streaming or real-time data is an example of data in motion.


Batch processing

Batch processing is the processing of big data for filtering, aggregating, and preparing very large datasets using long-running jobs in parallel.


Machine learning through MLlib

Apache Spark's machine learning library, MLlib, contains several machine learning algorithms and utilities used to forecast or predict future behaviors, outcomes, and trends using existing data.


Graph processing through GraphX

A graph data structure is a collection of nodes connected by edges. A graph database is used when you have hierarchical data or data with interconnected relationships. You might Spark's GraphX API is used to process hierarchical data.


SQL and structured data processing with Spark SQL

Spark SQL is used when you’re working with structured (formatted) data in your spark application.

Get started with .NET ApacheSpark

  1. Prepare your environment for .NET for Apache Spark
  2. Write your first .NET for Apache Spark application
  3. Build and run your .NET for Apache Spark application



Prepare your environment

Step-1 Install .NET

Download .NET SDK from https://dotnet.microsoft.com/ and install it and add dotnet toolchain to your PATH.

If you install .NET core SDK, open the command prompt, and run dotnetto verify successful dotnet installation.

Figure 1. command prompt
Step-2Install Java SDK

Download .NET SDK from https://www.oracle.com/java/technologies/ and install it and set an environment variable.

If you install JAVA SDK, open the command prompt, and run java -version to verify successful JAVA installation.

Figure 2 Command prompt
Step-3Install Compression software

Download https://www.7-zip.org/ or https://www.winzip.com/win/en/ software to extract the apache-spark file(Downloaded as .tgz file)

Step-3Install Apache spark

Download the latest stable version of apache spark from https://spark.apache.org/downloads.html and extract this file using 7-Zip software.

To extract the nested .tar file:

  1. Open the download folder and locate the spark-3.0.1-bin-hadoop2.7.tgz file.
  2. Right-click on the file and select 7-Zip -> Extract here.
  3. spark-3.0.1-bin-hadoop2.7.tar is created inside the .tgz file you downloaded.

To extract the Apache Spark files:

  1. Right-click on spark-3.0.1-bin-hadoop2.7.tgz.tar and select 7-Zip -> Extract files.
  2. Enter C:\bin in the Extract to the field
  3. Uncheck the checkbox below the Extract to the field.
  4. Select OK.
  5. The Apache Spark files are extracted to C:\bin\spark-3.0.1-bin-hadoop2.7\

Figure 3 Extraction using 7-ZIP
Step-3Set the environment variable

To set environment variable run following command to locate apache spark.

setxHADOOP_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\

Figure 4 Set environment variable

Open a command prompt and run the following code for verifying successful installation

%SPARK_HOME%\bin\spark-submit --version

Figure 5 Verify installation
Step-4Install .Net for Apache spark

Download Microsoft.Spark.A worker from https://github.com/dotnet/spark/releasesand extract file.

To extract the Microsoft.Spark.Worker:

Locate the Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipfile that you downloaded.

  1. Right-click on Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipand select 7-Zip -> Extract files. option
  2. Enter C:\bin in the Extract to the field.
  3. Uncheck the checkbox below the Extract to the field.
  4. Select OK.
  5. Set environment variable using following command
setx DOTNET_WORKER_DIR " C:\bin\Microsoft.Spark.Worker-1.0.0"
Step-5Install WinUtils

Download WinUtils from https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe.

After downloading copy this file to C:\bin\spark-3.0.1-bin-hadoop2.7\bin

Write .Net for Apache spark app Step-1 Create a console app

To create a console app, open the command prompt and write the following code.

Figure 6Create an application

Once the application is created type cd FirstApacheSparkApp


Planning to Hire Dedicated ASP.Net Developer ?

Your Search ends here.


Step-2 Install NuGet Package

Install Microsoft. Spark package to use .Net for Apache spark in your application. Open the command prompt and write the following code.

dotnet add package Microsoft.Spark
Step-3 Write your application

Open Program.cs file in Visual studio code or any editor and place following code:

Program.cs

using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;

namespace FirstApacheSparkApp
{
    class Program
    {
        static void Main(string[] args)
        {
         SparkSession spark =
                SparkSession
                    .Builder()
                    .AppName("word_count_sample")
                    .GetOrCreate();
            string filePath = "data.txt";
            DataFrame dataFrame = spark.Read().Text(filePath);
DataFrame words =
                dataFrame
                    .Select(Split(Col("value")," ").Alias("words"))
                    .Select(Explode(Col("words")).Alias("word"))
                    .GroupBy("word")
                    .Count()
                    .OrderBy(Col("count").Desc());

            words.Show();
spark.Stop();
        }
    }
}
Step-4 Create a data file

Application is used to process a file containing text. Create data.txt file in your application.

data.txt

viratkohli is known as the runmachine
rohitsharma is known as the hitman
hardikpandya is Known as kunfupandya
mahendrasinhDhoni is Known as captain cool
Step-5 Run your .Net For Apache spark application

Build your application using the following code

Dotnet build

Run your application using the following code:

%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp2.1\microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin\Debug\netcoreapp2.1\FirstApacheSparkApp.dll

Figure 7 Run your application

Conclusion

In this blog, we have discussed the .NET for Apache Spark and create console applications using .Net Apache spark. .Net spark is used to provide accessibility for spark API which can communicate with your application. .Net spark is used with various techniques like .net, R, python, etc.