The Web and Data Science course focuses on the study of large-scale socio-technical systems associated with the World Wide Web. It considers the relationship between people and technology, the ways that society and technology complement one another and the way they impact on broader society. These analyses are inherently associated with Big Data management issues.
The course is given in Como Campus by Marco Brambilla and Emanuele Della Valle.
Up-to-date calendar of the course on Google Docs: calendar
Official course page on Polimi site: web page
This is a (possibly partial) list of course materials related to the course:
LESSON 4. The logical architecture of a Big Data platform. Hands-on HIVE: context, read me, data (138MB!). Innovating the hadoop ecosystem: the approach of Berkeley Data Analytics Stack. Introduction to Spark.
LESSON 5. Let’s try Databricks Community Edition. From RDD, transformations, and actions to Datasets and SQL queries (Databricks’ notebooks). Notebooks developed in class: basics and word count (using hamlet.txt). Demo of 100x effect using the same data of the HIVE hands-on of lesson 4 (readme, 127 MB of data in parquet).
LESSON 8. Clustering and PCA
LESSON 9. Classification and Neural Networks
LESSON 10. Graph Databases – Neo4j
LESSON 11. Web Search foundations
LESSON 13. Data Wrangling
LESSON 14-15. Data Science with Spark 2.x using databricks notebooks. Please, rember to bring your laptops. Power extension cords will be available in class.
LESSON 16. Human Computation and Crowdsourcing
PAST YEAR’S CLASSES NOT INCLUDED IN 2017-18 PROGRAM:
LESSON . Recommendations
LESSON . RDF-S and OWL
40% of the grade of this course is granted based on the evaluation of a project work (see also the slides presented in the 1st lesson).
In order to access the data you shall use for the project, you have to sign a non-disclosure agreement (NDA). Please download it from http://bit.ly/WebSci2018NDA, complete it in digital form (so that I can read your email address), print it in two copies and bring it to the lecturer. As soon as you will submit the signed NDA, you will be able to access the data. Please, note that you will also be asked to sign a receipt of delivery.
Use this form to submit the proposals of your project work (http://bit.ly/WebSci2018SubmitPrjWork). The proposal includes the members of your group, a title and a short description of your proposal and the dataset you intend to use. Sending the form is not enough. Please, do not start working on your proposal straight away; wait for an email from prof. Emanuele Della Valle, which confirms the adequacy of your proposal.
The deadline for those that come to the lectures is October the 16th, 2017. For all the others, make sure that you contact prof. Emanuele Della Valle soon enough w.r.t. the exam session you want to come to. You have to submit the project 1 week before the exam session you want to attend.
Prof. Marco Brambilla obtained several Azure Passes from Microsoft. Please, refer to him to obtain them. The passes last 3 months, and they have a deadline on the activation date.
For those who are working on the project during the term, the 1st mid-term review will take place on November, the 20th, while the 2nd mid-term review will take place on December, the 4th. We prepared two doodles where you can book your time slot to talk with me and Marco: one for the 1st date and one for the 2nd date. We reserved 12 slots because so far we received 12 project proposals. Please put in the doodle the number you received in the email in which the project proposal was approved.
Content-wise, we expect you to be able to answer these questions:
1. what is your problem?
2. why are the chosen datasets useful to solve the problem?
3. which methods and technologies are you using?
4. can you show a partial implementation that shows you can solve the problem?
5. how do you intend to proceed to finish by December 15th-22th?
6. is there anything that is blocking us?
For those who are working on the project during the term, the final evaluation of your project will take place either on December, the 15th or on December, the 18th. We prepared a doodle with 12 slots. Please put in the doodle the number you received in the email in which the project proposal was approved.
Content-wise, we expect you to
- introduce your project work with some presentation (not necessarily a power point, also text and images in notebooks can do) that describes
- your problem
- the chosen datasets
- the rational of choosing those datasets to address your problem
- the methods and the technologies you used
- show practically what you did (executable notebooks are probably the best support, but you are free to use what you prefer)
- conclude with some presentation of the results you obtained
We appreciate if you can leave all this material to us in digital form.