As Julia White mentioned in her blog today, we’re pleased to announce the general availability of Azure Data Lake Storage Gen2 and Azure Data Explorer. We also announced the preview of Azure Data Factory Mapping Data Flow. With these updates, Azure continues to be the best cloud for analytics with unmatched price-performance and security. In this blog post we’ll take a closer look at the technical capabilities of these new features.
Azure Data Lake Storage – The no compromise Data Lake
Azure Data Lake Storage (ADLS) combines the scalability, cost effectiveness, security model, and rich capabilities of Azure Blob Storage with a high-performance file system that is built for analytics and is compatible with the Hadoop Distributed File System. Customers no longer have to tradeoff between cost effectiveness and performance when choosing a cloud data lake.
One of our key priorities was to ensure that ADLS is compatible with the Apache ecosystem. We accomplished this by developing the Azure Blob File System (ABFS) driver. The ABFS driver is officially part of Apache Hadoop and Spark and is incorporated in many commercial distributions. The ABFS driver defines a URI scheme that allows files and folders to be distinctly addressed in the following manner:
abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<filename>
It is important to note that the file system semantics are implemented server-side. This approach eliminates the need for a complex client-side driver and ensures high fidelity file system transactions.
To further boost analytics performance, we implemented a hierarchical namespace (HNS) which supports atomic file and folder operations. This is important because it reduces the overhead associated with processing big data on blob storage. This speeds up job execution and lowers cost because fewer compute operations are required.
The ABFS driver and HNS significantly improve ADLS’ performance, removing scale and performance bottlenecks. This performance enhancement is now available at the same low cost as Azure Blob Storage.
ADLS offers the same powerful data security capabilities built into Azure Blob Storage, such as:
- Encryption of data in transit and at rest via TLS 1.2
- Storage account firewalls
- Virtual network integration
- Role-based access security
In addition, ADLS’ file system provides support for POSIX compliant access control lists (ACLs). With this approach, you can provide granular security protection that restricts access to only authorized users, groups, or service principals and provides file and object data protection.
ADLS is tightly integrated with Azure Databricks, Azure HDInsight, Azure Data Factory, Azure SQL Data Warehouse, and Power BI, enabling an end-to-end analytics workflow that delivers powerful business insights throughout all levels of your organization. Furthermore, ADLS is supported by a global network of big data analytics ISV’s and system integrators, including Cloudera and Hortonworks.
Next steps
Azure Data Explorer – The fast and highly scalable data analytics service
Azure Data Explorer (ADX) is a fast, fully managed data analytics service for real-time analysis on large volumes of streaming data. ADX is capable of querying 1 billion records in under a second with no modification of the data or metadata required. ADX also includes native connectors to Azure Data Lake Storage, Azure SQL Data Warehouse, and Power BI and comes with an intuitive query language so that customers can get insights in minutes.
Designed for speed and simplicity, ADX is architected with two distinct services that work in tandem: The Engine and Data Management (DM) service. Both services are deployed as clusters of compute nodes (virtual machines) in Azure.
The Data Management (DM) service ingests various types of raw data and manages failure, backpressure, and data grooming tasks when necessary. The DM service also enables fast data ingestion through a unique method of automatic indexing and compression.
The Engine service is responsible for processing the incoming raw data and serving user queries. It uses a combination of auto scaling and data sharding to achieve speed and scale. The read-only query language is designed to make the syntax easy to read, author, and automate. The language provides a natural progression from one-line queries to complex data processing scripts for efficient query execution.
ADX is available in 41 Azure regions and is supported by a growing ecosystem of partners, including ISV’s and system integrators.
Next steps
Azure Data Factory Mapping Data Flow – Visual, zero-code experience for data transformation
Azure Data Factory (ADF) is a hybrid cloud-based data integration service for orchestrating and automating data movement and transformation. ADF provides over 80 built-in connectors to structured, semi-structured, and unstructured data sources.
With Mapping Data Flow in ADF, customers can visually design, build, and manage data transformation processes without learning Spark or having a deep understanding of their distributed infrastructure.
Mapping Data Flow combines a rich expression language with an interactive debugger to easily execute, trigger, and monitor ETL jobs and data integration processes.
Azure Data Factory is available in 21 regions and expanding, and is supported by a broad ecosystem of partners including ISV’s and system integrators.
Next steps
Azure is the best place for data analytics
With these technical innovations announced today, Azure continues to be the best cloud for analytics. Learn more why analytics in Azure is simply unmatched.
Leave a Reply