Instructions for lab sessions
Assessment is based on one programming assignment and an unseen exam. Most of the syllabus material is in the online lecture notes (below), but note-taking and additional reading is strongly advised.
The first meeting for the course will be the first lecture in week 1.
Your first lab will your first scheduled lab session after the lecture on k-means clustering.
If you prefer, you can approach me at the end of a lecture.
If that doesn't work, you can talk to me (or a lab tutor) in your next lab.
If all else fails, find me at the end of a lecture or lab and ask to arrange a special meeting.
Don't send me questions by email. This takes far too long, and I usually can't answer such things anyway, for fear of creating unfairness.
A good introduction to the area is included in Russell and Norvig `Artificial Intelligence: A Modern Approach'. The most up-to-date version of this is the 3rd edition.
Other, book-based introductions are
o http://www.thearling.com/text/dmtechniques/dmtechniques.htm Nice overview of ML techniques and issues. There is also a complete tutorial at http://www.thearling.com/dmintro/dmintro_2.htm
See also online materials for the Tan text.
Sometime in the early 1980s, an animal behaviour researcher studying the social behaviour of primates was killed in a car accident while returning from a field trip. In his belongings, police discovered several listings of data but the notebooks, assumed to contain details about the origins of the data, were all burned. The longest of the listings is reproduced at ex1-data.txt. This is in tabulated form and it is generally believed that each line records the values of several variables, followed by a value which is either 1 or 0. This is believed to be a classification value of some sort but the nature of the class has never been determined. The final 100 lines in the listing are missing this classification value.
The exercise involves (1) analysing the properties of the data, (2) implementing and then using one or more machine learning methods to derive predicted classification values for the final 100 cases and (3) submitting predictions online and (4) presenting the results of the experiment in a report.
Programs can be implemented in Java or any similar language, but not matlab.
Marks for the assessment will be awarded using the following scheme.
The full report should be submitted as a pdf document via Study Direct by end of Thursday week 8. As part of that submission, the predicted output values should be saved as a .txt file, with a name that is your last name followed by your first name. So if it was me, I would save the file as ThorntonChris.txt. The pdf file should have the same name, but with .pdf rather than .txt added. The file should contain the predicted classification value for the 100 test cases in the right order, with each value on a separate line. Viewed using WordPad or Notepad, the file should look something like ThorntonChris.txt.
General guidelines for the writing of program reports are below. However, the section requirements above should be treated as definitive.
Usual departmental penalties apply for late submissions.
Students resitting the course should make sure they implement a different method this time around.
This should be a general introduction to the theory, techniques and general area of work that the program relates to.
2. ANALYSIS OF THE TRAINING DATA
There is no `one right answer' for this section. Your analysis could be fairly formal, involving statistical analyses of correlation etc., possibly making use of Excel scattergrams etc. Or it could be more informal and rely on visual inspection of the data and the making of intelligent assumptions about the task from which the data were obtained.
This is where you present the results of your study, e.g., levels of training and cross-validation error achieved by your method(s). You may also want to show some sample predicted outputs. You may want to talk about how your program(s) deal with the anomalous cases in the data. If you've used the kNN method and employed cross-validation to determine the best value of k, this is where you would present a graph relating values of k to cross-validation error.
4. ILLUSTRATION - Illustrations of your program(s) working. This could be based on an edited output listing, with annotations to show what is going on. If the program uses a GUI, it could be based on a sequence of annotated screenshots. (If necessary `screenshots' can be drawn by hand.) However, Edit out repetitive chunks from any program output and edit in explanatory comments. Annotate the most significant bits of the output to draw attention to them.
5. SYSTEM OVERVIEW - Explain precisely HOW the program works.
You should discuss the main steps the program goes through to achieve its results. You do not need to describe every line of the program. You should give an overview of what sorts of things are represented, how they are represented, where input comes from, how it is transformed, where output goes to, etc. The overview should refer to a copy of the program attached as appendix-1. (See below) In particular make sure you describe what kinds of objects are represented, and how you represent them (e.g. using lists, or vectors, or numerical values for variables or whatever). Explain what problems your program had to solve, and how it solved them. Use figures and diagrams productively so as to back up the text. Do not rely on an automatic documentation facility (e.g., javadoc) to write the system overview for you.
When writing the overview, don't write as if you are communicating with a tutor. Equally don't write as if for a novice. Try to write for an audience consisting of fellow students who may be a week or two behind you in their understanding. When describing what a method does, always adhere to the principles of modularity, i.e., start by stating what sorts of inputs the method takes, and what sorts of output it produces. Give an explanation of the relation between input and output, and one or two simple examples to illustrate. If there are no inputs, or no outputs, then say so.
This is where you should assess what you have achieved and discuss any limitations or problems. It's also the place to talk about potential future developments. Don't be afraid to criticise your own program. If you have read about similar programs, or related programs, it may be appropriate to include some comparisons. What lessons, if any have been learnt? Were your goals achieved? Is there anything you now think you should have done differently?
This is where you present a formal evaluation of the whole submission in terms of the specified marking criteria. For each criterion, note how well you think you've done (and why) and give yourself a mark.
List books, articles, web pages, files etc. considered to be relevant.
8. CODE APPENDIX - The whole program, with comments explaining what the classes represent, what the methods do, what the global variables (if any) are for, etc.
Hand, D., Mannila, H. and Smyth, P. (2001). PRINCIPLES OF DATA MINING. Cambridge, Mass.: MIT Press.
Tan, P., Steinbech, M. and Kumar, V. (2006). INTRODUCTION TO DATA MINING. Boston: Pearson/Addison Wesley.