Skip to content
Data Science Lab at Politecnico di Milano

Data Science Lab at Politecnico di Milano

Menu
  • Home
  • Mission
  • Projects
  • Experiences
  • People
  • Collaborations
  • Publications
  • Teaching
  • News
    • Data Management for Large-scale Analytics (PhD Course)

Data Management for Large-scale Analytics (PhD Course)

This is the official page of the PhD course 055067 on “Data Management For Large-scale Analytics” organized for PhD program in Data Analytics and Decision Sciences by prof. Marco Brambilla and prof. Emanuele Della Valle in collaboration with prof. Stefano Ceri and prof. Danilo Ardagna.

Abstract

Large-scale data analytics is everywhere and researchers from all disciplines are addressing this topic from their own perspective, creating vertical excellent experiments, but often loosing the wider picture. This course aims at providing the principles, practices and technologies that enable large-scale data analytics and thus foster practice and academic debate around data science.

Contents

  • aPart 1: INTRO. Grand challenges of Data Analytics
    • Introduction to large-scale analytics
    • Opportunities for social, environmental and economic problems
    • Problem of current research in big data and data science
    • Data access and quality issues
  • Part 2: DATA. Data models and their implementations
    • Traditional ER and relational data models, SQL
    • Transactional and active databases
    • NoSQL data models: document, graph, column-based and key-value models
    • NoSQL platforms and technologies
    • Main memory large-scale databases
  • Part 3: FEATURES. Taming data volume, velocity, variety, and veracity
    • Volume: Scaling computation and storage horizontally
    • Map Reduce from Apache Hadoop to Apache Spark and Apache Flink
    • Velocity: Information flow processing principle, approaches and tools
    • Hands-on Apache Spark and Kafka to tame volume and velocity in data analytics
    • Veracity: data quality and data wrangling
    • Variety: web data extraction and data integration
  • Part 4: Project work

Calendar

TopicDateStart TimeEnd TimeHoursInstructorRoom

Part 1: INTRO. Grand challenges of Data Analytics
Introduction to large-scale analytics and opportunities for social, environmental and economic problems. [slides 1, slides 2] Feb 7 14:30 16:30 1 E. Della Valle PT1 – DEIB – Building 20
Problems in current research & data access and quality issues Feb 10 13:30 14:30 1 M. Brambilla PT1 – DEIB – Building 20
Part 2: DATA. Data models and their implementations
Traditional ER and relational data models and SQL [slides Rel.][slides SQL] Feb 10 14:30 16:30 2 M. Brambilla PT1 – DEIB – Building 20
Architectural and transactional aspects of databases [slides] Feb 10 16:30 18:30 2 S. Ceri PT1 – DEIB – Building 20
NoSQL data models, platforms and technologies: Graph and Main memory (key-value) large-scale databases [slides NoSQL][slides graph] Feb 11 10:00 14:00 2 M. Brambilla BIO1- Building 21 – First Floor
NoSQL data models, platforms and technologies. [slides nosql dbs] Feb 14 15:00 17:00 2 M. Brambilla PT1 – DEIB – Building 20
Part 3: FEATURES. Taming data volume, velocity, variety, and veracity
Volume: Scaling computation and storage horizontally [slides] Feb 18 10:00 11:00 1 D. Ardagna PT1 – DEIB – Building 20
Map Reduce from Apache Hadoop to Apache Spark and Apache Flink [slides] Feb 18 11:00 13:00 2 D. Ardagna PT1 – DEIB – Building 20
Velocity: Information flow processing principle, approaches and tools [slides] Feb 18 14:00 15:00 1 E. Della Valle PT1 – DEIB – Building 20
Hands-on Apache Spark and Kafka
  • Data engineering & Data science pipeline [slides]
  • Data ingestion [code][platform]
  • Data preparation, augmentation & analysis [slides][code][platform]
  • Kafka [slides] & KSQL [demo]
  • Feb 21 13:00 17:00 2 E. Della Valle PT1 – DEIB – Building 20
    – Veracity: data quality and data wrangling [slides][slides] May 29 14:00 16:00 2 M. Brambilla Online
    – Variety: web data extraction and data integration [slides][slides] June 5 16:30 18:30 1 M. Brambilla Online
    Part 4: Project Work
    Support to project work Mid-July TBD TBD 3 M. Brambilla + D. Ardagna Online
    Evaluation of project work Last week of July TBD TBD 3 M. Brambilla + E. Della Valle Online

    Exam

    Students will be required to build a research case, identifying business value, data and methods, using the tools to analyze and visualize data, critically analyzing pitfalls, and highlighting their contributions.

    The evaluation will be based on a concrete implementation of a case proposed by the instructors, where students will be asked to implement the data management phases discussed in class on a practical example, using cloud-based large-scale data management platforms and technologies.


    Recent News

    • PERISCOPE: the EU project on socio-economic and behavioral impacts of the COVID-19 pandemic
    • DATA-LIFE PROGRAM 2020
    • Challenges in Data-Driven Genomic Computing
    • FaST – Fashion Sensing Technology
    • A Tool for Extracting Emerging Knowledge from Social Media
    • Home
    • News
    • Mission
    • Projects
    • Teaching
      • Crash Course in Data Science – Passion in Action
      • Data Management for Large-scale Analytics (PhD Course)
      • Data Science for Business (Amsterdam)
      • Data Science for Business (Milan)
      • Unstructured and Streaming Data Engineering
      • Unstructured and Streaming Data Engineering 2019-20
    • Experiences
    • People
    • Collaborations
    • Publications

    Precious Lite 2022 | All Rights Reserved. Precious Lite theme by Flythemes