Computer Laboratory – Course pages 2016–17: Machine Learning and Real-world Data Skip to content | Access key help Search Advanced search A–Z Contact us Computer Laboratory Computer Laboratory Teaching Courses 2016–17 Machine Learning and Real-world Data Preparation for Computer Science Databases Digital Electronics Discrete Mathematics Foundations of Computer Science Graphics Hardware Practical Classes ML Practical Classes Object-Oriented Programming Registration Algorithms Machine Learning and Real-world Data Operating Systems Further Java Briefing Interaction Design Numerical Methods Software and Security Engineering Course pages 2016–17 Machine Learning and Real-world Data Syllabus Course materials Information for supervisors Principal lecturers: Dr Simone Teufel, Prof Ann Copestake Taken by: Part IA CST 75% Past exam questions No. of lectures and practical classes: 16 Suggested hours of supervisions: 4 Prerequisite courses: NST Mathematics Aims This course introduces students to machine learning algorithms as used in real-world applications, and to the experimental methodology necessary to perform statistical processing of large-scale unpredictable processes such as language, social networks or genetic data. Students will perform 3 extended practicals, as follows: Statistical classification: Determining a movie review’s sentiment using Naive Bayes (7 sessions) Sequence Analysis: Detection of proteins in genetic data using Hidden Markov Modelling (4 sessions) Network analysis of a social network, including detection of cliques and central nodes (5 sessions) Syllabus Topic One: Statistical Classification [7 sessions]. Introduction to Sentiment Classification. Naive Bayes Parameter Estimation. Statistical Laws of Language. Smoothing and Statistical Tests. Overtraining. Uncertainty and Human Agreement. Topic Two: Sequence Analysis [4 sessions]. Simple HMM Parameter Estimation. The Viterbi Algorithm. Random Baselines and Evaluation Metrics. Application to Protein Detection Data. Topic Three: Network Analysis [5 sessions]. Degree, Diameter, Visualisation. Random Networks and Small World Property. Betweenness Centrality. Clique Finding. Objectives By the end of the course students should be able to understand and program two simple supervised machine learning algorithms; use these algorithms in statistically valid experiments, including the design of baselines, evaluation metrics, statistical testing of results, and provision against overtraining; visualise and interpret examples of statistical laws of language; visualise the connectivity and centrality in large networks; use clustering (i.e., a type of unsupervised machine learning) for detection of cliques in unstructured networks. Recommended reading Jurafsky, D. & Martin, J. (2008). Speech and language processing. Prentice Hall. Durbin, R., Eddy, S., Krough, A. & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press. Easley, D. and Kleinberg, J. (2010). Networks, crowds, and markets: reasoning about a highly connected world. Cambridge University Press. © 2017 Computer Laboratory, University of Cambridge Information provided by Dr Simone Teufel