Bioinformatics
Code inspired by Bioinformatics Algorithms: an Active Learning Approach and by Rosalind.
NB: functions generally use zero based indexing; Rosalind uses 1-based.
I have written this code in Python 3, generally with the most recent release at the time each module was written-- Python 3.11.1 (tags/v3.11.1:a7a450f, Dec 6 2022, 19:58:39) at the time this file was updated. I have used newly added features whenever they might make me more productive. I've been writing code since 1968, during which time computer have got faster, but my brain hasn't. You are welcome to use this code if it is useful to you, subject to the terms of the Licence, but it may need to be adapted if you use an earlier version of Python.
Bioinformatics Algorithms TextBook Track
1. Where in the Genome does DNA replication begin?
2. Which DNA Patterns Play the Role of Molecular Clocks?
# | Location | Description |
---|---|---|
BA2A | sequence.py | Implement MotifEnumeration |
BA2B | sequence.py | Find a Median String |
BA2C | sequence.py | Find a Profile-most Probable k-mer in a String |
BA2D | sequence.py | Implement GreedyMotifSearch |
BA2E | sequence.py | Implement GreedyMotifSearch with Pseudocounts |
BA2F | sequence.py | Implement RandomizedMotifSearch |
BA2G | sequence.py | Implement GibbsSampler |
BA2H | sequence.py | Implement DistanceBetweenPatternAndStrings |
3. How Do We Assemble Genomes? See also Genome Assembly
# | Location | Description |
---|---|---|
BA3A | assemble.py | Generate the k-mer Composition of a String |
BA3B | assemble.py | Reconstruct a String from its Genome Path |
BA3C | assemble.py | Construct the Overlap Graph of a Collection of k-mers |
BA3D | assemble.py | Construct the De Bruijn Graph of a String |
BA3E | assemble.py | Construct the De Bruijn Graph of a Collection of k-mers |
BA3F | assemble.py | Find an Eulerian Cycle in a Graph |
BA3G | assemble.py | Find an Eulerian Path in a Graph |
BA3H | assemble.py | Reconstruct a String from its k-mer Composition |
BA3I | assemble.py | Find a k-Universal Circular String |
BA3J | assemble.py | Reconstruct a String from its Paired Composition |
asmq | ASMQ.py | Assessing Assembly Quality with N50 and N75 |
corr | CORR.py | Error Correction in Reads |
gasm | GASM.py | Genome Assembly Using Reads |
grep | GREP.py | Genome Assembly with Perfect Coverage and Repeats |
KMER | assemble.py | Generalizing GC-Content |
long | LONG.py | Genome Assembly as Shortest Superstring |
pcov | PCOV.py | Genome Assembly with Perfect Coverage |
4. How Do We Sequence Antibiotics? See also Computational Mass Spectrometry
5. How do we compare Biological Sequences? See also Alignment
6. Are there fragile regions in the human genome?
# | Location | Description |
---|---|---|
BA6A | BA6A.py fragile.py | Implement GreedySorting to Sort a Permutation by Reversals |
BA6B | BA6B.py fragile.py | Compute the Number of Breakpoints in a Permutation |
BA6C | BA6C.py fragile.py | Compute the 2-Break Distance Between a Pair of Genomes |
BA6D | BA6D.py | Find a Shortest Transformation of One Genome into Another by 2-Breaks WIP |
BA6E | fragile.py | Find All Shared k-mers of a Pair of Strings |
BA6F | BA6F.py fragile.py | Implement Chromosome to Cycle |
BA6G | BA6G.py fragile.py | Implement Cycle to Chromosome |
BA6H | BA6H.py fragile.py | Implement Coloured Edges |
BA6I | BA6I.py fragile.py | Implement GraphToGenome |
BA6J | BA6J.py fragile.py | Implement 2-BreakOnGenomeGraph |
BA6K | BA6K.py fragile.py | Implement 2-BreakOnGenome |
7. Which animal gave us SARS? See also Phylogeny
8. How did Yeast become a Winemaker?
# | Location | Description |
---|---|---|
BA8A | BA8A.py | Implement FarthestFirstTraversal |
BA8B | BA8B.py | Compute the Squared Error Distortion |
BA8C | BA8C.py | Implement the Lloyd Algorithm for k-Means Clustering |
BA8D | BA8D.py | Implement the Soft k-Means Clustering Algorithm |
BA8E | BA8E.py | Implement Hierarchical Clustering |
9. How do we locate disease causing mutations?
10. Why have biologists still not developed an HIV vaccine?
# | Location | Description |
---|---|---|
BA10A | BA10A.py hmm.py | Compute the Probability of a Hidden Path |
BA10B | BA10B.py hmm.py | Compute the Probability of an Outcome Given a Hidden Path |
BA10C | BA10C.py hmm.py | Implement the Viterbi Algorithm |
BA10D | BA10D.py hmm.py | Compute the Probability of a String Emitted by an HMM |
BA10E | BA10E.py hmm.py | Construct a Profile HMM |
BA10F | BA10F.py hmm.py | Construct a Profile HMM with Pseudocounts |
BA10G | BA10G.py hmm.py | Perform a Multiple Sequence Alignment with a Profile HMM WIP |
BA10H | BA10H.py hmm.py | Estimate the Parameters of an HMM |
BA10I | BA10I.py hmm.py | Implement Viterbi Learning |
BA10J | BA10J.py hmm.py | Solve the Soft Decoding Problem |
BA10K | BA10K.py hmm.py | Implement Baum-Welch Learning |
11. Was T.rex just a big chicken?
# | Location | Description |
---|---|---|
BA11A | BA11A.py spectrum.py | Spectrum Graph Construction |
BA11B | BA11B.py spectrum.py | Implement DecodingIdealSpectrum |
BA11C | BA11C.py spectrum.py | Convert a Peptide into a Peptide Vector |
BA11D | BA11D.py spectrum.py | Convert a Peptide Vector into a Peptide |
BA11E | BA11E.py spectrum.py | Sequence a Peptide WIP |
BA11F | BA11F.py spectrum.py | Find a Highest-Scoring Peptide in a Proteome against a Spectrum WIP |
BA11G | BA11G.py spectrum.py | Implement PSMSearch WIP |
BA11H | BA11H.py spectrum.py | Compute the Size of a Spectral Dictionary WIP |
BA11I | BA11I.py spectrum.py | Compute the Probability of a Spectral Dictionary WIP |
BA11J | BA11J.py spectrum.py | Find a Highest-Scoring Modified Peptide against a Spectrum WIP |
Combinatorics
# | Location | Description |
---|---|---|
combinatorics.py | Combinatorics | |
aspc | rosalind.py | Introduction to Alternative Splicing |
cat | cat.py combinatorics.py | Catalan Numbers and RNA Secondary Structures |
cntq | cntq.py | Counting Quartets |
cunr | phylogeny.py | Counting Unrooted Binary Trees |
eubt | eubt.py phylogeny.py | Enumerating Unrooted Binary Trees |
fibd | rosalind.py | Mortal Fibonacci Rabbits |
fib | rosalind.py | Rabbits and Recurrence Relations |
motz | motz.py combinatorics.py | Motzkin Numbers and RNA Secondary Structures |
mrep | mrep.py | Identifying Maximal Repeats |
mrna | rosalind.py | Inferring mRNA from Protein |
orf | orf.py | Open Reading Frames |
pmch | pmch.py | Perfect Matchings and RNA Secondary Structures |
pper | rosalind.py | Partial Permutations |
rnas | rnas.py | Wobble Bonding and RNA Secondary Structures WIP |
root | phylogeny.py | Counting Rooted Binary Trees |
Genome Rearrangements
# | Location | Description |
---|---|---|
lgis | lgis.py | Longest Increasing Subsequence |
rear | rear.py fragile.py | Reversal distance |
sort | sort.py fragile.py | Sorting by Reversals |
Graphs & Graph Algorithms
# | Location | Description |
---|---|---|
2sat | sat.py graphs.py | 2-Satisfiability |
bf | bf.py graphs.py | Bellman-Ford Algorithm |
bfs | bfs.py graphs.py | Breadth First search |
bins | BINS.py | Binary Search |
bip | bip.py graphs.py | Testing Bipartiteness |
cc | cc.py graphs.py | Connected Components |
cte | cte.py graphs.py | Shortest Cycle Through a Given Edge |
dag | dag.py graphs.py | Testing Acyclicity |
ddeg | DDEG.py graphs.py | Double-Degree Array |
deg | DEG.py graphs.py | Degree Array |
dfs | graphs.py | Depth First Search ??? |
dij | dij.py graphs.py | Dijkstra's Algorithm |
grph | graphs.py | Overlap Graphs |
gs | gs.py graphs.py | General Sink |
hdag | hdag.py graphs.py | Hamiltonian Path in DAG |
lrep | lrep.py | Finding the Longest Multiple Repeat WIP |
nwc | nwc.py graphs.py | Negative Weight Cycle |
scc | scc.py graphs.py | Strongly Connected Components |
sc | sc.py graphs.py | Semi-Connected Graph |
sdag | sdag.py graphs.py | Shortest Paths in DAG |
sq | sq.py graphs.py | Square in a Graph |
suff | suff.py graphs.py | Creating a Suffix Tree |
tree | tree.py phylogeny.py | Completing a Tree |
trie | trie.py snp.py | Introduction to Pattern Matching |
ts | ts.py align.py | Topological sort |
Heredity
# | Location | Description |
---|---|---|
iev | rosalind.py | Calculating Expected Offspring |
indq | rosalind.py | Independent Segregation of Chromosomes |
iprb | rosalind.py | Mendel's First Law |
lia | rosalind.py | Independent Alleles |
mend | mend.py phylogeny.py | Inferring a Genotype from a Pedigree |
sexl | rosalind.py | Sex-Linked Inheritance |
Population Dynamics
# | Location | Description |
---|---|---|
afrq | rosalind.py | Counting Disease Carriers |
foun | rosalind.py | The Founder Effect and Genetic Drift |
wfmd | rosalind.py | The Wright-Fisher Model of Genetic Drift |
Probability
# | Location | Description |
---|---|---|
ebin | rosalind.py | Wright-Fisher's Expected Behavior |
eval | rosalind.py | Expected Number of Restriction Sites |
mend | mend.py phylogeny.py | Inferring a Genotype from a Pedigree |
rstr | rosalind.py | Matching Random Motifs |
Sequence Analysis
# | Location | Description |
---|---|---|
frmt | FRMT.py | Data Formats |
gbk | GBK.py | GenBank Introduction |
orfr | orfr.py | Finding Genes with Open Reading Frames |
ptra | PTRA.py | Protein Translation |
Sorting
# | Location | Description |
---|---|---|
2sum | 2sum.py | 2SUM |
3sum | 3sum.py | 3SUM |
ins | INS.py | Insertion Sort |
inv | INV.py | Counting Inversions |
med | MED.py | Median |
mer | MER.py | Merge Two Sorted Arrays |
mprt | MPRT.py | Finding a Protein Motif |
ms | MS.py | Mergesort |
par | PAR.py | 2-Way Partition |
par3 | PAR3.py | 3-Way Partition |
ps | ps.py | Partial sort |
qs | QS.py | Quicksort |
Strings & String Algorithms
# | Location | Description |
---|---|---|
cons | rosalind.py | Consensus and Profile |
dna | dna.py rosalind.py | Counting DNA Nucleotides |
gc | rosalind.py | Computing GC Content |
itwv | itwv.py align.py | Finding Disjoint Motifs in a Gene |
kmer | rosalind.py | Generalizing GC-Content |
kmp | kmp.py | Speeding Up Motif Finding |
ksim | ksim.py | Finding All Similar Motifs WIP |
lcsm | rosalind.py | Finding a Shared Motif |
lcsq | lcsq.py | Finding a Shared Spliced Motif |
ling | ling.py | Linguistic Complexity of a Genome |
ukkonen.py | ยตTuX's implementation of Ukkonen's algorithm | |
mmch | mmch.py | Maximum Matchings and RNA Secondary Structures |
prot | spectrum.py | Translating RNA into Protein |
revc | rosalind.py | Complementing a Strand of DNA |
revp | revp.py rosalind.py | Locating Restriction Sites |
rna | rna.py rosalind.py | Transcribing DNA into RNA |
scsp | scsp.py | Interleaving Two Motifs |
splc | spectrum.py | RNA Splicing |
sseq | rosalind.py | Finding a Spliced Motif |
subs | rosalind.py | Finding a Motif in DNA |
Set Theory
# | Location | Description |
---|---|---|
pdpl | pdpl | Creating a Restriction Map |
seto | - | Set operations Solved using Python shell |
sset | - | Counting Subsets Solved with a single line in Python shell: 2**n%1000000 |
Other Problems
# | Location | Description |
---|---|---|
fibo | fibo.py | Fibonacci numbers |
C++ code
Location | Header | Description |
---|---|---|
Makefile | for C++ code | |
newick.cpp | newick.h | Parser for files in Newick format |
qrtd.cpp | qrtd.h | Quartet Distance WIP |
test-newick.cpp | Test Newick Parser | |
test-qrtd.cpp | Test Quartet Distance WIP | |
test-tree.cpp | Tests for tree.cpp | |
tests.cpp | catch.hpp | C++ test framework |
tree.cpp | tree.h | Code representing a phylogeny |
Support
Location | Description |
---|---|
bioinformatics.bib | Bibliography |
findtests.py | Generates shell script to execute all tests |
helpers.py | Utilities for formatting output, parsing input, etc |
LICENSE | License Agreement |
newick.py | Parser for files in Newick format |
README.md | This file |
rosalind.py | Shared code |
rosalind.wpr | WingWare Project File |
template.py | Template for getting started on a problem |