This is the archived version of the USDE 2019-20 course page. The page of the 2020-21 edition is here.
Introductions
The course provides the foundational concepts and methods for designing, storing, analyzing and managing semi-structured and unstructured data, both in batch and in streaming. The course aims to tame the variety (data in many forms) and velocity (analyzing data streams to enable real-time decisions) dimensions of Big Data, without forgetting the volume dimension.
The variety-oriented part of the course will focus on NoSQL (and not-only-SQL) models and technologies. Students will learn how to select appropriate data management solutions to deal with scalability, availability, consistency, performance and expressiveness requirements.
The course will cover high-level Big Data problems and dimensions, No-SQL data models and technologies (graph, column, document, key-value based storage; persistent and volatile solutions) and design techniques for NoSQL, the transition from ACID to BASE transactional properties, the specification of CRUD primitives (create, read, update, delete) implemented at scale, and the sharding and replication strategies.
The velocity-oriented part of the course will focus on time series, data streams and events both from a deductive and an inductive perspective. The deductive one focuses on domain-specific languages and knowledge representation techniques. Its main goal is to guide the students in exploring the trade-off between usability and rich formal semantics of query languages. The inductive one examines machine-learning problems focusing on massive online learning and, in particular, on the ability to learn when to forget past information.
Finally, the course will cover the basic aspects of the data analysis pipeline from a data engineering perspective: acquisition, integration, exploration, mining, analytics, visualization, and interpretation.
Teachers
The course is offered by prof. Marco Brambilla and prof. Emanuele Della Valle. For the current academic year (2019-21) prof. E. Della Valle is reponsible for the course and prof. M. Brambilla assists him. Next academic year (2020-21), they will switch roles.
Calendar
- 16-Sep-2019 10:30-12:00: Introduction to the course & Big Data Engineering/Science – Della Valle
- 19-Sep-2019 13:30-15:00: Intro to Nosql & Data Models – [slides] Brambilla
- 23-Sep-2019 10:30-12:00: Graph (neo4j) – [slides] Brambilla
- 26-Sep-2019 13:30-15:00: Key-value DB (Redis) – [slides] Brambilla
- 30-Sep-2019 10:30-12:00: columnar DB (Cassandra) – [slides] Brambilla
- 03-Oct-2019 13:30-15:00: SUSPENDED FOR GRADUATION EXAMS
- 07-Oct-2019 10:30-12:00: RDF [slides] +Sparql [slides] [blackboard]- Della Valle
- 10-Oct-2019 13:30-15:00: Document NoSQL (Mongo) – [slides] Brambilla
- 14-Oct-2019 10:30-12:00: Exercise RDF+Sparql on DBPedia Knowledge Graph [gitter channel]- Della Valle
- 17-Oct-2019 13:30-15:00: Intro to scalable processing: Hadoop + MapReduce [slides] & Vertical vs Horizontal Scalability [slides] – Della Valle
- 21-Oct-2019 10:30-12:00: Moder scalable processing with Spark [slides][tool:databricks CE][hands-on] – Della Valle
- 24-Oct-2019 13:30-15:00: Taming Data Velocity [slides] – Della Valle
- 28-Oct-2019 10:30-12:00: InfluxDB and flux [slides] – Della Valle
- 31-Oct-2019 13:30-15:00: EPL for stream transformations and complex event processing [slides][try on-line][data/queries][code] – Della Valle
- 04-Nov-2019 10:30-12:00: SUSPENDED FOR MID-TERM EXAMS (no mid-term for the course)
- 07-Nov-2019 13:30-15:00: suspended
- 11-Nov-2019 10:30-12:00: Crowdsourcing – Human Computation [slides]- Brambilla
- 14-Nov-2019 13:30-15:00: Kafka [slides] – Della Valle
- 18-Nov-2019 10:30-12:00: Crowdsourcing – Gamification [slides]- Brambilla
- 21-Nov-2019 13:30-15:00: KSQL [demo][system]- Della Valle
- 25-Nov-2019 10:30-12:00: Stream Reasoning and c-sparql [slides1,slides2][demo][system] – Della Valle
- 28-Nov-2019 13:30-15:00: Data engineering & Data science pipeline [slides] + Data ingestion [code][platform] – Della Valle
- 02-Dec-2019 10:30-12:00: SUSPENDED
- 05-Dec-2019 13:30-15:00: Data collection (scraping + API) [slides] – Brambilla
- 09-Dec-2019 10:30-12:00: Data wrangling [slides]- Brambilla
- 12-Dec-2019 13:30-15:00: Data preparation, augmentation & analysis [slides][code][platform] – Della Valle
- 16-Dec-2019 10:30-12:00: AWS offering for unstructured and streaming data engineering [slide]- Alex Casalboni
- 19-Dec-2019 13:30-15:00: Exam Preview [slide], Project Requirement and Streaming Machine Learning [slide] – Della Valle, Brambilla, Bernardo
Grading
The exam consists of a written test and an optional practical project (max 7 marks).
- The written test contains questions, to be answered in free text, regarding any of the course subjects, and exercises. Students can get uo to 30L in the written test.
- The practical project requires to use one or more of the technologies presented in the lectures. It is optional. Only students, who will get at least 24/30 in the written exam, can opt for it. Students can directly agree the project topic with one of the two teachers.
- The final grade is computed as follows: written text result + pratical project result. E.g., written text 25 + pratical project 7 = 30L