Data Science Group – Politecnico di Milano


The Web and Data Science course focuses on the study of large-scale socio-technical systems associated with the World Wide Web. It considers the relationship between people and technology, the ways that society and technology complement one another and the way they impact on broader society. These analyses are inherently associated with Big Data management issues.

The course is given in Como Campus by Marco Brambilla and Emanuele Della Valle.

Up-to-date calendar of the course on Google Docs: calendar

Official course page on Polimi site: web page

This is a (possibly partial) list of course materials related to the course:

LESSON 1. IntroductionScenarios (1) and Scenarios (2), Exam and project rules

LESSON 2. Introduction to Big Data.  How Netflix uses Big Data since mid 2000s’. Vertical vs. horizontal scalability.

LESSON 3. Scaling storing horizontally with Key-Value pairs. Scaling processing horizontally with Map-Reduce.

LESSON 4. The logical architecture of a Big Data platform. Hands-on HIVE: context, read me, data (138MB!). Innovating the hadoop ecosystem: the approach of Berkeley Data Analytics Stack. Introduction to Spark.

LESSON 5. Let’s try Databricks Community Edition. From RDD, transformations, and actions to Datasets and SQL queries (Databricks’ notebooks). Notebooks developed in class: basics and word count (using hamlet.txt). Demo of 100x effect using the same data of the HIVE hands-on of lesson 4 (readme, 127 MB of data in parquet).

LESSON 6. Practical statistics for Web Science. Intro to R and related tools. R introductory examples.

LESSON 7. Web API, Rest API and Scraping for Web data collection. Including Source code of examples (ZIP).

LESSON 8. Clustering and PCA



LESSON . Classification

LESSON . Recommendations

LESSON . Data Wrangling and Data Cleansing

LESSON . Web Search foundations

LESSON . Human Computation and Crowdsourcing

LESSON . Semantic Web and RDF and exercise


LESSON . RDF-S and OWL practical cases. Solutions

LESSON . SPARQLexamples, and putting it all together

Additional resources:

Guidelines for the exam


40% of the grade of this course is granted based on the evaluation of a project work (see also the slides presented in the 1st lesson).

In order to access the data you shall use for the project, you have to sign a non-disclosure agreement (NDA). Please download it from, complete it in digital form (so that I can read your email address), print it in two copies and bring it to the lecturer. As soon as you will submit the signed NDA, you will be able to access the data. Please, note that you will also be asked to sign a receipt of delivery.

Use this form to submit the proposals of your project work ( The proposal includes the members of your group, a title and a short description of your proposal and the dataset you intend to use. Sending the form is not enough. Please, do not start working on your proposal straight away; wait for an email from prof. Emanuele Della Valle, which confirms the adequacy of your proposal.

The deadline for those that come to the lectures is October the 16th, 2017. For all the others, make sure that you contact prof. Emanuele Della Valle soon enough w.r.t. the exam session you want to come to. You have to submit the project 1 week before the exam session you want to attend.

Prof. Marco Brambilla obtained several Azure Passes from Microsoft. Please, refer to him to obtain them. The passes last 3 months, so do not ask for it unless you intend to start the project.