Highlights
Project 1: A Python Proteome Analysis Pipeline
This project is designed to be completed by a group of 4 people, but 3 should still be able to manage it comfortably with some help. The project is divided into a series of Tasks, each of which will require some development of a simple Python program or will require some execution of the code supplied for groups smaller than 4. The Tasks will lead in to, or follow on from, another task, and hence it is essential that you think carefully about the formats your programs will be able to cope with, and that you draw up sensible specifications before you all start coding. You are expected to produce standalone python programs (i.e. not Jupyter notebooks) as *.py files which can be run from the unix command line. However, you can develop your code in whatever environment you choose, though the CoCalc one is fine and will be where we test out your programs for assessment purposes.
Overview: You will be given a genome sequence, from a bacterial organism, which has just been sequenced. An experimental group working on this organism wishes to carry out some proteomics experiments to identify proteins from the microbe. However, they need a set of proteins in a FASTA formatted file in order to search against with their mass spectra. They then want to know which is the best digestive enzyme to use, so that they get the best chance of identifying as many proteins as possible. In other words, they want to find the enzyme which produces the lowest number (on average) of peptides per protein, or indeed can unambiguously identify a protein. If a peptide mass is rare, or ideally totally unique across a proteome (given the accuracy of the instrument), it will effectively be more diagnostic of the parent protein. You can assume that their mass spectrometer only measures peptide mass-to-charge values (effectively, their mass) between 1000 and 1500 Da.
Task 1 – An ORF finder
You will need to write a Python program which can read in a Fasta format file containing nucleotide sequence (DNA sequence, containing ACGT) and find all the possible open reading frames (known as ORFs, beginning with ATG1, and ending with a stop codon, TGA, TAA, TAG). You can also assume that the longest ORF will be used, and not have to worry about any internal ones. You will have to worry about all 6 open reading frames though, and predict ORFs in all 6. Your program should accept standard Fasta format as input, of the form:
>Genome sequence
AATTCAGTTACTTATTCCCCTTATTGGCAGTATTAACGCATAGGACTGCG
ATAGACTTTACTAAGCTCAGTATCATTCCCTTATTGGCATAGTTACGCAT
CCACCAATAGACTTTACTATTCCCTTCATGGCATATTTACGCAGTAGGAC
ATAGACT
And should output protein ORFs with the following fasta format:
>orf_name frame length start
MAKSKDFPVKADFAAHVAIQSEFAHGV
>orf_name frame length start
MHILDECCAKSKDFPDFAAHVAYIQSEFAA
LLILPQWNSDAAHGV
Orf_name should be a unique name for the ORF, which might include the name of the genome/organism, frame in which the ORF is found, and some number (ie. if the organism is called
I.claudius you could call your ORFs something CLAUD_F1_0017 (for the 17th ORF in frame 1 in I.claudius).
Output should be written to standard output.
Please note: like most genomes, it contains a few ambiguously called bases. This means that they are given the letter ‘n’ instead of A, C, G or T. Obviously, these can not be formally translated in an amino acid in most cases. I suggest you ignore all ORFs that contain a codon with a ‘N’ in it.
You should also think carefully about the minimum size of an ORF. I suggest you add an option which controls this, and I would use 50 amino acids as the minimum size, or perhaps even 100 amino acids (= 300 nucleotides). You might like to do some research into the typical size of bacterial genes/proteins?
Bonus tasks:
Error checking: spot improper formats, unknown amino acids, and correctly deal with “gaps” (there normally won’t be any of course).
Add some additional command line options in, such as the ability to write to a specified output file instead of standard out with –o filename, or restriction to just one frame with –f frame_number. Deal with more than one sequence, and more than one file supplied on the command line as arguments
This BIOL60201 - IT Assignment has been solved by our IT experts at My Uni Papers. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.
© Copyright 2026 My Uni Papers – Student Hustle Made Hassle Free. All rights reserved.