iFour Team - December 04, 2020
.Net Core is free open-source platform that was developed by Microsoft along with .net core is a cross platform that supports all most operating systems like macOS, Windows and Linux.
Apache-Spark is an open-source,universal-purpose distributed processing enginefor analytics over large data setscommonly terabytes or petabytes of data. Apache spark can be used in Machine learning, real-time streams, ad-hoc query, and big data..
Apache spark is fast because processing tasks are distributed over a cluster of nodes, and data is cached in memory, to reduce computation time.It is faster than Hadoop and used with java,R,Python, SQL, and .NET technology.
Spark is a general-purpose distributed processing engine that can be used for following big data scenarios.
Extract, transform, and load (ETL) is the process of collecting multiple data from one or multiple sources, modifying the data, and moving the data to a new data store. Following are ways of transforming data:
Filtering
Sorting
Aggregating
Joining
Cleaning
Deduplicating
Validating
Apache spark is used to process large streams of datasuch as monitoring streams of sensor data or analyzing financial transactions to detect fraud. Spark streaming is used in apache spark for real-time data stream processing. Streaming or real-time data is an example of data in motion.
Batch processing is the processing of big data for filtering, aggregating, and preparing very large datasets using long-running jobs in parallel.
Apache Spark's machine learning library,MLlib, contains several machine learning algorithms and utilities used to forecast or predict future behaviors, outcomes, and trends using existing data.
A graph data structure is a collection of nodes connected by edges. A graph database is used when you have hierarchical data or data with interconnected relationships. You might Spark's GraphX API is used to process hierarchical data.
Spark SQL is used when you’re working with structured (formatted) data in your spark application.
Get started with .NET ApacheSpark
Prepare your environment for .NET for Apache Spark
Write your first .NET for Apache Spark application
Build and run your .NET for Apache Spark application
Download .NET SDK from https://dotnet.microsoft.com/ and install it and add dotnet toolchain to your PATH.
If you install .NET core SDK, open the command prompt, and run dotnetto verify successful dotnet installation.
Figure 1. command prompt
Download .NET SDK from https://www.oracle.com/java/technologies/ and install it and set an environment variable.
If you install JAVA SDK, open the command prompt, and run java -version to verify successful JAVA installation.
Figure 2 Command prompt
Download https://www.7-zip.org/ or https://www.winzip.com/win/en/ software to extract the apache-spark file(Downloaded as .tgz file)
Download the latest stable version of apache spark from https://spark.apache.org/downloads.html and extract this file using 7-Zip software.
To extract the nested .tar file:
Open the download folder and locate the spark-3.0.1-bin-hadoop2.7.tgz file.
Right-click on the file and select 7-Zip -> Extract here.
spark-3.0.1-bin-hadoop2.7.tar is created inside the .tgz file you downloaded.
To extract the Apache Spark files:
Right-click on spark-3.0.1-bin-hadoop2.7.tgz.tar and select 7-Zip -> Extract files.
Enter C:\bin in the Extract to the field
Uncheck the checkbox below the Extract to the field.
Select OK.
The Apache Spark files are extracted to C:\bin\spark-3.0.1-bin-hadoop2.7\
Figure 3 Extraction using 7-ZIP
To set environment variable run following command to locate apache spark.
setxHADOOP_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\ setx SPARK_HOME C:\bin\spark-3.0.1-bin-hadoop2.7\
Figure 4 Set environment variable
Open a command prompt and run the following code for verifying successful installation
%SPARK_HOME%\bin\spark-submit --version
Figure 5 Verify installation
Download Microsoft.Spark.A worker from https://github.com/dotnet/spark/releasesand extract file.
To extract the Microsoft.Spark.Worker:
Locate the Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipfile that you downloaded.
Right-click on Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zipand select 7-Zip -> Extract files. option.
Enter C:\bin in the Extract to the field.
Uncheck the checkbox below the Extract to the field.
Select OK.
Set environment variable using following command
setx DOTNET_WORKER_DIR " C:\bin\Microsoft.Spark.Worker-1.0.0"
Download WinUtils from https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe.
After downloading copy this file to C:\bin\spark-3.0.1-bin-hadoop2.7\bin
Write .Net for Apache spark app Step-1 Create a console app.
To create a console app, open the command prompt and write the following code.
Figure 6Create an application
Once the application is created type cd FirstApacheSparkApp
Install Microsoft. Spark package to use .Net for Apache spark in your application. Open the command prompt and write the following code.
dotnet add package Microsoft.Spark
Open Program.cs file in Visual studio code or any editor and place following code:
Program.cs
using Microsoft.Spark.Sql; using static Microsoft.Spark.Sql.Functions; namespace FirstApacheSparkApp { class Program { static void Main(string[] args) { SparkSession spark = SparkSession .Builder() .AppName("word_count_sample") .GetOrCreate(); string filePath = "data.txt"; DataFrame dataFrame = spark.Read().Text(filePath); DataFrame words = dataFrame .Select(Split(Col("value")," ").Alias("words")) .Select(Explode(Col("words")).Alias("word")) .GroupBy("word") .Count() .OrderBy(Col("count").Desc()); words.Show(); spark.Stop(); } } }
Application is used to process a file containing text. Create data.txt file in your application.
data.txt
viratkohli is known as the runmachine rohitsharma is known as the hitman hardikpandya is Known as kunfupandya mahendrasinhDhoni is Known as captain cool
Build your application using the following code
Dotnet build
Run your application using the following code:
%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp2.1\microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin\Debug\netcoreapp2.1\FirstApacheSparkApp.dll
Figure 7 Run your application
In this blog, we have discussed the .NET for Apache Spark and create console applications using .Net Apache spark. .Net spark is used to provide accessibility for spark API which can communicate with your application. .Net spark is used with various techniques like .net, R, python, etc.
January 23, 2023
September 21, 2021
September 20, 2021
July 23, 2021
July 06, 2021
Technology that meets your business requirements
Planning a cutting-edge technology software solution? Our team can assist you to handle any technology challenge. Our custom software development team has great expertise in most emerging technologies. At iFour, our major concern is towards driving acute flexibility in your custom software development. For the entire software development life-cycle, we implement any type of workflow requested by the client. We also provide a set of distinct flexible engagement models for our clients to select the most optimal solution for their business. We assist our customers to get the required edge that is needed to take their present business to next level by delivering various out-reaching technologies.
Get advanced technology that will help you find the
right answers for your every business need.
Get in touch
Drop us a line! We are here to answer your questions 24/7.