CSC8303 - Assignment Top: Index Previous: Properties Up: Index CSC8303 -- Bioinformatics Programming in Java This is the sole assignment for the CSC8303 course. Do not worry if you are unsure of the techniques needed, they will become clear during the lectures and practicals. You have been asked, as a part of a larger bioinformatics project, to design a package that parses EMBL files. As your work will be utilised by other members of the lab it important you test your API thoroughly. You can obtain EMBL files for testing from the previous link. It is important that you comment your code and provide suitable documentation (5 marks). Finally you must follow a suitable style for your code, including appropriate variable names, indenting and spacing, up to 25% of available marks will be awarded for clearly written code, conforming to standard coding conventions. You are required to carry out all of steps 1-3 and one part of step 4 EMBL files adhere to a clear, well defined structure, they have two character tags at the start of a line, followed by three spaces, then a value. For example, the first line of the file is the ID line, and includes the entry ID as well as the sequence length. Develop an API for representing the information in an EMBL file. This must include information on the EMBL ID, the species information as a list of taxon classifications, and the DNA sequence. In terms of Java classes, you need to produce at least EMBL.java and Sequence.java, with EMBL classes having a Sequence object as a field. (8 marks) Develop a class with a main method that takes the path of an EMBL file as a command line parameter. This class should use the file path passed to the main method to create a Scanner object, and then parses the EMBL file into the classes you have defined in 1. (4 marks) The Sequence class should implement the java.lang.CharSequence interface. It should store the sequence as a List of java.lang.Character objects. In particular, charAt(int) should extract the appropriate Character from the list and return the equivalent char (6 marks) Finally carry out one of the following: (6 marks) Write a method that searches the Sequence for a given DNA string. Write a test method, which searches for a Shine-Dalgarno Sequence (AGGAGGU) in the DNA sequence. Write a class that, given a set of EMBL records, discovers taxonomic values which all of the species of origin share. This should be done by comparing the internal taxanomic list in your EMBL class. So, for example, given Homo sapiens and Gorilla gorilla, the EMBL organism classification lines would read as follows:
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Gorilla.
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo.
hence, the list "Eukaryota" "Metazoa" "Chordata" "Craniata" "Vertebrata" "Euteleostomi" "Mammalia" "Eutheria" "Euarchontoglires" "Primates" "Catarrhini" "Hominidae" should be returned Files should be uploaded as Java source files. Any files that you wish to support the code should be .txt files. Top: Index Previous: Properties Up: Index