Introduction
The course provides the foundational concepts and methods for designing, storing, analyzing and managing semi-structured and unstructured data, both in batch and in streaming. The course aims at taming the variety (data in many forms) and velocity (analyzing data streams to enable real-time decisions) dimensions of Big Data, without forgetting the volume dimension.
The variety-oriented part of the course will focus on NoSQL (and not-only-SQL) models and technologies. Students will learn how to select appropriate data management solutions to deal with scalability, availability, consistency, performance and expressiveness requirements.
The course will cover high-level Big Data problems and dimensions, No-SQL data models and technologies (graph, column, document, key-value based storage; persistent and volatile solutions) and design techniques for NoSQL, the transition from ACID to BASE transactional properties, the specification of CRUD primitives (create, read, update, delete) implemented at scale, and the sharding and replication strategies.
The velocity-oriented part of the course will focus on time series, data streams and events both from a deductive and an inductive perspective. The deductive one focuses on domain-specific languages and knowledge representation techniques. Its main goal is to guide the students in exploring the trade-off between usability and rich formal semantics of query languages. The inductive one examines machine-learning problems focusing on massive online learning and, in particular, on the ability to learn when to forget past information.
Finally, the course will cover the basic aspects of the data analysis pipeline from a data engineering perspective: acquisition, integration, exploration, mining, analytics, visualization, and interpretation.
Teachers
The course is offered by prof. Emanuele Della Valle and prof. Marco Brambilla.
Calendar
IMPORTANT
Hereafter, you find a first draft of the calendar of the course. According to the most recent information about COVID-19 regulation for PoliMI’s courses, lectures marked with [L] will take place online, while lectures marked with [E1] and [E2] will be offered in class. [E1] and [E2] will be identical sessions. Students, whose ID (codice persona) is even, must come to the [E1] classes, while students, whose ID is odd, must come to the [E2] classes.
- [L] 17/09/2020 (3h) Introduction to the course – on the need for unstructured and streaming Data Engineering – E. Della Valle [VIDEO RECORDING]
- [L] 24/09/2020 (3h) Data architectures and Unstructured Data Models – M. Brambilla [RECORDING][slides]
- [L] 08/10/2020 (3h) Unstructured Data Models: Graph and Key-Value Databases – M. Brambilla[RECORDING] [slides]
- [E] 13/10/2020 (1h) Practice and Q&A Session on Graph DB – Neo4J – M. Brambilla [RECORDING] (no slides)
- [L] 15/10/2020 (3h) Unstructured Data Models: Document and Columnar Databases – M. Brambilla [RECORDING] [slides]
- [L] 22/10/2020 (3h) Streaming Data Engineering – E. Della Valle [RECORDING][slides][drawing]
- [L] 29/10/2020 (3h) EPL – E. Della Valle [RECORDING][slides (without solutions)][try on-line][gitter room][data/queries][code]
- [E] 03/11/2020 (1h) Practice and Q&A Session [RECORDING] – E. Della Valle [slides (with solutions)]
- [L] 05/11/2020 (3h) Kafka – E. Della Valle [RECORDING][slides]
- [E] 10/11/2020 (1h) Practice and Q&A Session – E. Della Valle [RECORDING] [infrastructure as code][notebook on Kafka Basics]
- [L] 12/11/2020 (3h) KSQL – E. Della Valle [RECORDING][slides 1st part][slides 2nd part][code/data]
- [L] 19/11/2020 (3h) Spark Structured Streaming [RECORDING][spark’s slides][Structured Streaming’s slides] – E. Della Valle
- [L] 26/11/2020 (3h) Data acquisition – M. Brambilla [RECORDING][slides1][slides2][code]
- [L] 3/12/2020 (3h) Flux – E. Della Valle [RECORDING] [slides]
- [L] 10/12/2020 (3h) Data wrangling – M. Brambilla [RECORDING] [slides][code]
- [L] 17/12/2020 (3h) Crowdsourcing – M. Brambilla [RECORDING] [slides1] [slides2]
- Exam Structure and Project Requirements [RECORDING][slides]
Grading
The exam consists of a written test and an optional practical project (max 7 marks).
- The written test contains questions, to be answered in free text, regarding any of the course subjects, and exercises. Students can get up to 30L in the written test.
- The practical project requires to use one or more of the technologies presented in the lectures. It is optional. Only students, who will get at least 24/30 in the written exam, can opt for it.
- The final grade is computed as follows: written text result + pratical project result. E.g., written text 25 + pratical project 7 = 30L
This is the official page of the USDE 2020-21 course page. The page of the 2019-20 edition is archived here.