Data Science Group – Politecnico di Milano

Course

The Web and Data Science course focuses on the study of large-scale socio-technical systems associated with the World Wide Web. It considers the relationship between people and technology, the ways that society and technology complement one another and the way they impact on broader society. These analyses are inherently associated with Big Data management issues.

The course is given in Como Campus by Marco Brambilla and Emanuele Della Valle.

Up-to-date calendar of the course on Google Docs: calendar

Official course page on Polimi site: web page

This is a (possibly partial) list of course materials related to the course:

LESSON 1. IntroductionScenarios (1) and Scenarios (2), Exam and project rules

LESSON 2. Introduction to Big Data.  How Netflix uses Big Data since mid 2000s’. Vertical vs. horizontal scalability.

LESSON 3. Scaling storing horizontally with Key-Value pairs. Scaling processing horizontally with Map-Reduce.

LESSON 4. The logical architecture of a Big Data platform. Hands-on HIVE: context, read me, data (138MB!). Innovating the hadoop ecosystem: the approach of Berkeley Data Analytics Stack. Introduction to Spark.

LESSON 5. Let’s try Databricks Community Edition. From RDD, transformations, and actions to Datasets and SQL queries (Databricks’ notebooks). Notebooks developed in class: basics and word count (using hamlet.txt). Demo of 100x effect using the same data of the HIVE hands-on of lesson 4 (readme, 127 MB of data in parquet).

LESSON 6. Practical statistics for Web Science. Intro to R and related tools. R introductory examples. Inferential Statistics examples and resources.

LESSON 7. Web API, Rest API and Scraping for Web data collection. Including Source code of examples (ZIP).

LESSON 8. Clustering and PCA

LESSON 9Classification and Neural Networks

LESSON 10. Graph Databases – Neo4j

LESSON 11. Web Search foundations

LESSON 12. Convolutional neural networks (by Darian Frajberg) and geometric deep learning (by Federico Monti)

LESSON 13. Data Wrangling

LESSON 14-15. Data Science with Spark 2.x using databricks notebooks. Please, rember to bring your laptops. Power extension cords will be available in class.

LESSON 16. Human Computation and Crowdsourcing

LESSON 17. Semantic Web technologies: the interoperability problem and intro su Semantic Web technologies

 

PAST YEAR’S CLASSES NOT INCLUDED IN 2017-18 PROGRAM:

LESSON . Recommendations

LESSON . RDF and exercise

LESSON . RDF-S and OWL

LESSON . RDF-S and OWL practical cases. Solutions

LESSON . SPARQLexamples, and putting it all together

Additional resources:

Guidelines for the exam

PROJECT WORK

40% of the grade of this course is granted based on the evaluation of a project work (see also the slides presented in the 1st lesson).

In order to access the data you shall use for the project, you have to sign a non-disclosure agreement (NDA). Please download it from http://bit.ly/WebSci2018NDA, complete it in digital form (so that I can read your email address), print it in two copies and bring it to the lecturer. As soon as you will submit the signed NDA, you will be able to access the data. Please, note that you will also be asked to sign a receipt of delivery.

Use this form to submit the proposals of your project work (http://bit.ly/WebSci2018SubmitPrjWork). The proposal includes the members of your group, a title and a short description of your proposal and the dataset you intend to use. Sending the form is not enough. Please, do not start working on your proposal straight away; wait for an email from prof. Emanuele Della Valle, which confirms the adequacy of your proposal.

The deadline for those that come to the lectures is October the 16th, 2017. For all the others, make sure that you contact prof. Emanuele Della Valle soon enough w.r.t. the exam session you want to come to. You have to submit the project 1 week before the exam session you want to attend.

Prof. Marco Brambilla obtained several Azure Passes from Microsoft. Please, refer to him to obtain them. The passes last 3 months, and they have a deadline on the activation date.

For those who are working on the project during the term, the 1st mid-term review will take place on November, the 20th, while the 2nd mid-term review will take place on December, the 4th. We prepared two doodles where you can book your time slot to talk with me and Marco: one for the 1st date and one for the 2nd date. We reserved 12 slots because so far we received 12 project proposals. Please put in the doodle the number you received in the email in which the project proposal was approved.

Content-wise, we expect you to be able to answer these questions:
1. what is your problem?
2. why are the chosen datasets useful to solve the problem?
3. which methods and technologies are you using?
4. can you show a partial implementation that shows you can solve the problem?
5. how do you intend to proceed to finish by December 15th-22th?
6. is there anything that is blocking us?

For those who are working on the project during the term, the final evaluation of your project will take place either on December, the 15th or on December, the 18th. We prepared a doodle with 12 slots. Please put in the doodle the number you received in the email in which the project proposal was approved.

Content-wise, we expect you to

  1. introduce your project work with some presentation (not necessarily a power point, also text and images in notebooks can do) that describes
    • your problem
    • the chosen datasets
    • the rational of choosing those datasets to address your problem
    • the methods and the technologies you used
  2. show practically what you did (executable notebooks are probably the best support, but you are free to use what you prefer)
  3. conclude with some presentation of the results you obtained

We appreciate if you can leave all this material to us in digital form.