Microsoft and the .NET Foundation have released version 1.0 of .NET for Apache Spark, an open source package that brings .NET development to the Spark analytics engine for large-scale data processing.
Announced October 27, .NET for Apache Spark 1.0 has support for .NET applications targeting .NET Standard 2.0 or later. Users can access Spark DataFrame APIs, write Spark SQL, and create user-defined functions UDFs).
The .NET for Apache Spark framework is available on the .NET Foundation’s GitHub page or from NuGet. Other capabilities of .NET for Apache Spark 1.0 include:
- An API extension framework to add support for additional Spark libraries including Linux Foundation Delta Lake, Microsoft OSS Hyperspace, ML.NET, and Apache Spark MLlib functionality.
- .NET for Apache Spark programs that are not UDFs show the same speed as Scala and PySpark-based non-UDF applications. If applications include UDFs, .NET for Apache Spark programs are at least as fast as PySpark programs or might be faster.
- .NET for Apache Spark is built into Azure Synapse and Azure HDInsight. It also can be used in other Apache Spark cloud offerings including Azure Databricks.
The first public version of the project was announced in April 2019. Driving the development of .NET for Apache Spark was increased demand for an easier way to build big data applications instead of having to learn Scala or Python. The project is operated under the .NET Foundation and has been filed as a Spark Project Improvement Proposal to be considered for inclusion in the Apache Spark project directly.
Looking ahead, Microsoft is addressing obstacles including setting up prerequisites and dependencies and finding quality documentation, with examples such as community-contributed “ready-to-run” Docker images and updates to .NET for Apache Spark documentation. Another priority is supporting deployment options including integration with CI/CD devops pipelines and publishing jobs directly from Visual Studio.