Essential toolset for analytics

Self-service, fail-safe exploratory environment for collaborative data science workflow

Unified single sign-on experience between cloud and on-premise tools Similar user experience across AWS, GCP, and Azure clouds Easy to use Web Interface
macSwiper
macSwiper
macSwiper
macSwiper
macSwiper
macSwiper
Leverage the power of analytical tools Project level collaboration environment across multiple Clouds Aggregated billing with cost comparison across cloud providers
Amazon Web Services
Amazon Elastic MapReduce, EMR
Jupyter Notebook
Apache Zeppelin
Python
MongoDB. Cross-platform document-oriented database program
TensorFlow. An open source machine learning framework
RStudio

Problems We Solve

Data access problem

+

Problems with tools

+

Increased IT handholding

+

Lack of security

+

Data access problem

Solve data access problem
  • Access to full datasets is on “need to know” basis
  • Access to production data is limited
  • Data access is governed by corporate/BU/department policies

Problems with tools

Solve problems with tools
  • Installation of tools difficult and expensive (lack of ops & security skills)
  • Latest & greatest tool versions aren’t available in the shared environment
  • Lack of collaboration: everyone has its own silo

Increased IT handholding

Increased IT handholding
  • Data sciences are bound to local machines, due to the cost of IT support
  • Lack of self-services leads to stream of requests to ops teams
  • Resources allocation monitoring and security control in the shared environments

Lack of security

Solve lack of security
  • Local data copies are not secure (DS get copies of data on their local machines)
  • Cloud security policies for data protection are difficult to implement
  • Cutting edge tools require enterprise hardening (tools have to be secured)

Our Solution

Self-service and customization

Self-service and customization

  • Self-provisioning of the environment
  • Plug-in proprietary components; change tool versions
  • No dedicated IT Support is involved

Secured and fail safe

  • Break it and simply re-provision a fresh environment
  • Provides a client’s security perimeter
  • Security policies and access controls
Secured and fail safe
Collaborative workflow

Collaborative workflow

  • Collaboration between data science team members
  • Collaboration between data science and engineering teams
  • Easy path from R&D to Production

Exploratory environment

  • Best open source data tools
  • Horizontally scalable compute
  • Easy access to data and metadata
  • Training / workshops
Exploratory environment
Flexible deployment architecture

Flexible deployment architecture

  • Public cloud: Azure & AWS, GCP (initial version)
  • No platform or vendor lock-in

Features

Customizable authentication Unified single sign-on experience between cloud
Similar user experience across AWS, GCP, and Azure clouds Similar user experience across clouds
Easy to use Web Interface Easy to use Web Interface
Leverage the power of analytical tools Leverage the power of analytical tools
Data visualization with data science libraries Data visualization with data science libraries
Easy Life Cycle management Easy Life Cycle management
Install libraries and dependencies Install libraries and dependencies
Multiple Git repositories support Multiple Git repositories support
Control Cloud Services Usage Control Cloud Services Usage
Manage Cloud services quota limits Manage Cloud services quota limits
Role permissions administration Role permissions administration
Project level collaboration environment across multiple Clouds Project level collaboration across сlouds
Project level collaboration environment across multiple Clouds Track key activities
Storage space collaboration Storage space collaboration
Unified single sign-on experience between cloud and on-premise tools

Unified single sign-on experience between cloud and on-premise tools

  • SAML and OAuth2.0 support
Similar user experience across AWS, GCP, and Azure clouds

Similar user experience across AWS, GCP, and Azure clouds

Easy to use Web Interface

Easy to use Web Interface

  • Single-click life-cycle management of analytical tools and computational resources
  • Dashboard and in-grid filtering
  • Built-in “how-to” pages
Leverage the power of analytical tools

Leverage the power of analytical tools

  • Apache Zeppelin
  • Deep Learning
  • Jupyter
  • Jupyter with TensorFlow
  • JupyterLab
  • RStudio
  • RStudio with TensorFlow (implemented on AWS)
  • Superset (implemented on GCP)
  • Add computational power to your jobs by deploying Standalone Apache Spark cluster/EMR (AWS)/Dataproc (GCP) on memory, storage, compute, GPU optimized instances
Run your analytics effectively

Run your analytics effectively

  • Python v2/v3, Scala, Apache SparkR support
  • Data visualization with pre-provisioned data science libraries
  • Cloud Storage compatibility
  • Amazon S3
  • Azure Blob Storage, Azure Data Lake
  • Google Buckets
Easy Life Cycle management

Easy Life Cycle management

  • Once setup, create a snapshot and save time setting up new environment
  • Start / Stop / Terminate notebooks
  • Configure start/stop schedule for analytical tool and computational resources
  • Add/Remove computational resources (Cloud provider’s Data Engine Service or Apache Spark standalone cluster)
  • Simply create an AMI image based on required tool and save time for further provisioning and configuration
  • Fine-tune default Spark configuration
Easy Life Cycle management

Easy Life Cycle management

  • Enrich analytical experience installing libraries and dependencies
Collaboration capabilities

Collaboration capabilities

  • Shared codebase
  • Shared Storage
  • Shared Wiki
  • DataLab Web UI for repositories access management
  • Build in UnGit for easy to use source collaboration
Control Cloud Services Usage

Aggregated billing with cost comparison across cloud providers

  • In-grid billing functionality
  • Details billing report
  • Admin mode to see all users
Manage Cloud services quota limits

Manage Cloud services quota limits

  • Set cost limits for DataLab’s infrastructure
  • Set limits per user
  • Alert users on reaching quota
  • Stop/terminate environment when quota gets depleted
Role permissions administration

Role permissions administration

  • Assign users to DataLab groups
  • Grant users ability to create analytical tools/clusters/leverage particular instance shapes
  • Admin mode to manage infrastructure of the whole team
Project level collaboration

Project level collaboration environment across multiple Clouds

  • Connect you analytical environment to multiple cloud endpoints
  • Project level resource management
Audit

Provide a quick overview of the key activities in DataLab

  • Check all user activities on DataLab
Bucket  browser

File management in Cloud storage via Bucket browser

  • Possibility to upload/download/delete file on clouds storage
  • Access any bucket from the assigned projects