A blog about my experiences with bioinformatics, operating systems, and random other technologies and bits.

Wednesday, March 28, 2012

Biological Alignment Software, Used for Finding Common Phrases

Recently I was asked to compare a second manuscript to one we'd previously written on a very similar subject to make sure there wasn't too much "copy and paste". While this could probably be done by eye, I like to do things the easy (hard? but at least fun) way, so I wondered if diff or some other UNIX utility could help me out. Apparently, not, or at least not easily. The next thing to try in my repertoire was something from the massive array of biological sequence alignment software.

While there are probably better (and expensive) options for finding similarities between documents, primarily for detecting plagiarism, I wondered if local alignment software could also do the job. Here's an example using fairly lenient gap penalties with the 'water' program from EMBOSS.

Using a simple script to convert two text files to two FASTA files for input in to your aligner of choice, I got this output from water:

I also tried making an English alphabet substitution matrix (that was an identity matrix), but it didn't seem to matter much. Due to special meanings of some characters and the ability to only use English alphabet characters in the aligners I tested so far, it would be best to edit the source code of the alignment package for more serious use. Also, water was somewhat slow on this alignment, so using a heuristic package like SOAP or BLAST might be better for long documents.