Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
BINF3020 - Assignment 1, Part 2 BINF3020 - Semester 2, 2017 Assignment 1 - Protein sequence evolution and alignment Allocation: assignment 1 is worth 20% of your final assessment. The aim of this assignment is for you to write software in C, Java, Python (or alternative by arrangement) and run it on some biological sequence data to generate outputs which you will write up in a comprehensive report. The assignment will test your understanding of some key concepts of evolutionary modelling and sequence alignment, and will require you to demonstrate good written communication skills. The assignment is structured to have three deliverable parts, as follows. Part 1 implement a protein sequence evolution simulator [35% of total mark] Part 2 implement alignment by dynamic programming [45% of total mark] Part 3 run your software to simulate the evolution of a protein sequence over increasing evolutionary time, and align the mutated sequences with the original. Write a report including a plot of percentage identity against evolution time and discuss the results. [20% of total mark] Each part is built upon the previous part and you should complete each part before moving on to the next. The marking will be based on having a running program for each part and writing a proper report. Marks will also be given for good programming and writing style. Marks will be deducted for program bugs and other errors. The standard warnings against plagiarism apply and will definitely be enforced. Penalties for late submission of assignment parts are under consideration. Part 2 Description of the task: You will implement pairwise global alignment of protein sequences in C or Java. The method will be the Needleman-Wunsch dynamic programming algorithm using an affine gap cost method for scoring gaps. You can base your implementation on the BINF3020 lecture notes on "Pairwise Sequence Alignment 1: Optimal methods". However, it is best to get started by first implementing and testing the Needleman-Wunsch algorithm using a linear gap penalty d = 8 , and only when that is working, refine your program to implement the affine gap penalty method using a gap opening penalty d = 7 and a gap extension penalty e = 1 . These penalty values should be set using constants internally in your code. Your program needs to handle the following task. The output file generated by your program for part 1 contains, in order, your original personal unique amino acid sequence followed by 500 sequences representing its evolutionary divergence over 500 mutation cycles. The alignment program should find the optimal alignment score of the original sequence with itself and with each of the 500 mutated sequences. The output for each alignment will contain two values. First will be the alignment score returned by the dynamic programming algorithm, and second the percentage of identities in the alignment. Note that you don't have to generate the actual alignment, just these scores. BLOSUM62 amino acid substitution matrix: A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 Be aware that this matrix contains rows and columns for three extra amino acid code letters than were in your mutation matrix, namely B, X and Z. This should not affect the alignment of your mutated sequences but it should make your alignment program more general, i.e. able to handle other sequences. A version of this BLOSUM62 amino acid substitution matrix in comma-separated format for copy-and-pasting is at the end of this page. The basic steps in your program will be to read in the initial sequence and the 500 mutations of it from input. Then use your global alignment procedures to do the following. First align the initial sequence S0 with itself, computing the score for the global alignment, and the percentage of identities in the alignment. In this case the length of the alignment is the same as the length of the two identical sequences, and the percentage of identities will clearly be 100%. Then for each of the remaining 500 mutated sequences Si in turn, align S0 with Si, computing the score for the global alignment, and the percentage of identities in the alignment. These alignment scores for every sequence from 0 to 500 should be saved on a separate line to a single output file. The index number of the sequence should precede the scores on each line. To compute the percentage of identities requires adding two counters, say k and id, to the traceback part of global alignment algorithm, as shown in the lecture notes. Before starting each alignment traceback set k and id = 0. Then at each step in the traceback you increment k, so k ends up being the length of the alignment. Also, at each step in the traceback where a residue a from sequence S0 is matched with a residue b from sequence Si the counter id is incremented by 1 if a is IDENTICAL to b. The final percentage of identities in the alignment is 100 * (id / k). Your submission: A single source file "align.c", "align.java", or "align.py" which should be adequately commented. However, you must ensure that your submitted version of the program compiles under gcc or javac and runs properly on CSE machines. You can use the give system on CSE machines to submit your assignment. $ give bi3020 ass1_2 align.c or $ give bi3020 ass1_2 align.java or $ give bi3020 ass1_2 align.py Inputs and Outputs: s501 the input file containing the initial sequence plus all 500 mutated sequences in FASTA format a501 the output file containing the self-alignment score and percentage of identities for the initial sequence plus the alignment scores and percentage of identities for all 500 mutated sequences In more detail, the output file should contain, on separate lines, the sequence number and the alignment score and the percentage of identities for each sequence. The initial sequence is numbered zero, and its alignment score is for self-alignment. Call this v0 . The percentage of identities in the alignment of this sequence with itself is obviously 100%. Call this p0 . All mutated sequences are numbered from 1 to 500, and their alignment scores are with respect to the initial sequence. For sequence Si the score is vi and the percentage of identities is pi . The order is the same as that in which they appear in the input file. The output should form three columns, like this: N Score Percentage Identities 0 v0 p0 1 v1 p1 2 v2 p2 3 v3 p3 ... ... ... 500 v500 p500 where the first line should contain the header information as shown. You should assume that the I/O comes from stdin and goes to stdout. For example, the program would be executed as follows: $ align < s501 > a501 where `s501' is the file containing the evolved sequences (plus the original) from your evolution simulator implemented in Part 1, and `a501' is the corresponding output file in the format described above. Alternatively, you should be able to run both your programs as a pipeline in a Unix shell, like this: $ evolve < s001.fasta | align > a501 where `s001.fasta' is a valid FASTA format file, such as the original sequence file assigned to you. The only input routines you need are to read sequences in FASTA format. If you made your routines modular for part 1 you can re-use them in part 2. You will need to write out to a file containing, on each line, the sequence number and the alignment score and number of identities. Your program should report an error and halt if the input is not in FASTA format or the sequence contains something other than the original 20 amino acid code letters. Deadline: You must submit part 2 before 23:59:59 on Wednesday August 23, 2017. A version of the BLOSUM62 amino acid substitution matrix in comma-separated format for copy-and-pasting: ,A,B,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y,Z A,4,-2,0,-2,-1,-2,0,-2,-1,-1,-1,-1,-2,-1,-1,-1,1,0,0,-3,-1,-2,-1 B,-2,6,-3,6,2,-3,-1,-1,-3,-1,-4,-3,1,-1,0,-2,0,-1,-3,-4,-1,-3,2 C,0,-3,9,-3,-4,-2,-3,-3,-1,-3,-1,-1,-3,-3,-3,-3,-1,-1,-1,-2,-1,-2,-4 D,-2,6,-3,6,2,-3,-1,-1,-3,-1,-4,-3,1,-1,0,-2,0,-1,-3,-4,-1,-3,2 E,-1,2,-4,2,5,-3,-2,0,-3,1,-3,-2,0,-1,2,0,0,-1,-2,-3,-1,-2,5 F,-2,-3,-2,-3,-3,6,-3,-1,0,-3,0,0,-3,-4,-3,-3,-2,-2,-1,1,-1,3,-3 G,0,-1,-3,-1,-2,-3,6,-2,-4,-2,-4,-3,0,-2,-2,-2,0,-2,-3,-2,-1,-3,-2 H,-2,-1,-3,-1,0,-1,-2,8,-3,-1,-3,-2,1,-2,0,0,-1,-2,-3,-2,-1,2,0 I,-1,-3,-1,-3,-3,0,-4,-3,4,-3,2,1,-3,-3,-3,-3,-2,-1,3,-3,-1,-1,-3 K,-1,-1,-3,-1,1,-3,-2,-1,-3,5,-2,-1,0,-1,1,2,0,-1,-2,-3,-1,-2,1 L,-1,-4,-1,-4,-3,0,-4,-3,2,-2,4,2,-3,-3,-2,-2,-2,-1,1,-2,-1,-1,-3 M,-1,-3,-1,-3,-2,0,-3,-2,1,-1,2,5,-2,-2,0,-1,-1,-1,1,-1,-1,-1,-2 N,-2,1,-3,1,0,-3,0,1,-3,0,-3,-2,6,-2,0,0,1,0,-3,-4,-1,-2,0 P,-1,-1,-3,-1,-1,-4,-2,-2,-3,-1,-3,-2,-2,7,-1,-2,-1,-1,-2,-4,-1,-3,-1 Q,-1,0,-3,0,2,-3,-2,0,-3,1,-2,0,0,-1,5,1,0,-1,-2,-2,-1,-1,2 R,-1,-2,-3,-2,0,-3,-2,0,-3,2,-2,-1,0,-2,1,5,-1,-1,-3,-3,-1,-2,0 S,1,0,-1,0,0,-2,0,-1,-2,0,-2,-1,1,-1,0,-1,4,1,-2,-3,-1,-2,0 T,0,-1,-1,-1,-1,-2,-2,-2,-1,-1,-1,-1,0,-1,-1,-1,1,5,0,-2,-1,-2,-1 V,0,-3,-1,-3,-2,-1,-3,-3,3,-2,1,1,-3,-2,-2,-3,-2,0,4,-3,-1,-1,-2 W,-3,-4,-2,-4,-3,1,-2,-2,-3,-3,-2,-1,-4,-4,-2,-3,-3,-2,-3,11,-1,2,-3 X,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1 Y,-2,-3,-2,-3,-2,3,-3,2,-1,-2,-1,-1,-2,-3,-1,-2,-2,-2,-1,2,-1,7,-2 Z,-1,2,-4,2,5,-3,-2,0,-3,1,-3,-2,0,-1,2,0,0,-1,-2,-3,-1,-2,5 Part 3 To be announced Last modified Thursday 17 August 15:50:26 AEST 2017