mILD: matrix-based ILD analyses Paul J. Planet, MD, PhD Indra Neil Sarkar, PhD Division of Invertebrate Zology American Museum of Natural History Contact: sarkar@amnh.org ----------------------------------------------------- BEFORE USING mILD, PLEASE READ THIS ENTIRE DOCUMENT! ESPECIALLY THE "RUNNING mILD" SECTION, WHICH CONTAINS PARTICULARS ABOUT HOW TO PROPERLY FORMAT AND ANNOTATE THE INPUT FILES FOR USE BY mILD. ----------------------------------------------------- **** REQUIREMENTS mILD requires the command line version of PAUP* to be installed on your OS X or UNIX machine. mILD is not designed to work with either the PC or Mac OS 9 versions of PAUP*. For purchase and support information with respect to PAUP*, visit: http://paup.csit.fsu.edu/ It is STRONGLY recommended that you have the command-line version of PAUP* installed and with a soft link to it from /usr/local/bin/. mILD expects paup to be in the /usr/local/bin/ directory, with an alias called "paup". To verify that your machine is properly set up, from the command line, type "/usr/local/bin/paup" and you should enter the PAUP* interface. If this is not the case, you should contact your system administrator on how to do this. Alternatively, if you have PAUP* installed correctly on your machine, and do not wish to create a soft link to the /usr/local/bin/paup alias, you can specifically indicate the location and name of the PAUP* executable directly from the command-line arguments for mILD. The argument name is "paupCmd". See below for a detailed list of command line arguments that you can use with mILD. **** INSTALLATION mILD is distributed in two ways- an OS X installer and a UNIX Source Code Distribution. This text file is distibuted only in the UNIX Source Code OS X Installer -- Uncompress the archive (should happen automatically during download) -- Double click the installer package, "mILD_OSX" -- The installer will install the mILD script into the /usr/bin directory. This is done as the default installation because all OS X machines have this directory on their machine and configured to be accessible from the command line --- If you would like to install the script into another universally accessible location you should use the UNIX Source Code Distribution, described below. UNIX Source Code Distribution -- After you have uncompressed the archive (e.g., using the command "tar xvfz mILD_UNIX.tgz"), you should see two files: "README" (this file), and "mILD" (the mILD script) -- Copy the mILD file into a location that is universally accessible (e.g., /usr/local/bin or /usr/bin) **** RUNNING mILD >> SETTING UP PARTITION FILES mILD is designed to take a set of FASTA or PIR formatted files that are all contained in a single directory. Be sure that all the files in the directory are named correctly; --> Every gene partition for the mILD analysis needs to be contained in individual files. FILES SHOULD BE NAMED WITH ONLY A SINGLE EXTENSION. For example: "parA.txt", "parA.faa", or "parA.pir" are valid file names; "parA.1.txt", "parA.faa.1", or "parA.faa.pir" are NOT valid file names. --> File names should follow bacterial gene nomenclature rules. That is, first letters lower-case followed by numeric or capital letters (e.g., parA is a valid name; "PARA" is not). --> File names CANNOT begin with a capital letter. --> File names CANNOT begin with the letter "P" followed by a number; this is a reserved character by mILD for combined partitions. --> File names cannot contain either a '_' or '!' in their file names. The comment lines in either the PIR or FASTA format should be in the following format: >P1;000_111_Genus_Species_Strain_Comments -- 'P1;' is PIR specific, and is not required by mILD -- '000' is a "set number", which is used by other GenomeCurator applications - it is not used by mILD -- '111' is a "gene number", which is used by other GenomeCurator applications - is not used by mILD -- 'Genus' is the genus name associated with the sequence; spaces are not permitted -- 'Species' is the species name associated with the sequence; spaces are not permitted -- 'Comments' can be anything that you specify Take note that all fields in the comment line are separated by underscores "_". For those fields that are not used by mILD, you can leave them blank. For example: >__Genus_Species_Strain_ is a valid sequence name. Genus and Species SHOULD NOT BE LEFT BLANK, as this is the information that is used by mILD to keep track of sequences throughout the analysis. GENUS AND SPECIES NAMES FOR SEQUENCES ACROSS PARTITIONS MUST BE CONSISTENT, INCLUDING CAPITALIZATION. Duplicates are permitted within individual partition files, however only the longest non-gapped sequence will be used by mILD. You may download a set of example files from the mILD website, which contains files that are annotated correctly. >> USING mILD mILD runs from the command line, and interacts directly with PAUP*. It is set to show the PAUP* interface while processing, so that you can monitor mILD's progress and enter any information required by PAUP* during execution. To begin an analysis, you must be in the parent directory above the directory that contains the individual partition files. For example, if your files are contained in "~/mILD/analysis1/" you must be in "~/mILD/" before running mILD. mILD is currently designed to perform two types of tests: -- snowball ILD (this can be done with either a partition count or character based combination criterion) -- jackknife ILD -- based on LeCointre's taxon jacknife You need to specify, through command-line arguments which test you want to run. It is possible to perform both a jackknife and snowball analysis, in which case the jackknife will be done first followed by the snowball analysis. Additionally, you need to specify the directory that contains the individual partition files. The basic syntax to run the mILD jackknife is: mILD -dir=[partition file directory] -jk where, [partition file directory] = name of the directory that contains the individual partition files Similarily, the basic syntax to run the mILD snowballing techniques is: 1. mILD -dir=[partition file directory] -s=0 2. mILD -dir=[partition file directory] -s=1 where, -s=[0|1] - tells mILD to run either the partition count (0) or character count (1) version of the snowballing technique, shown in lines 1 and 2 above, respectively. Here is a complete set of command line parameters: mILD -dir=[partition file directory] -s=[0] ==> run snowball ILD with either parition count criterion (0) or character count criterion (1) -jk ==> run LeCointre Jackknife -cutoff=[0.01] ==> congruency P-Value cutoff -ildreps=[100] ==> number of repititions per ILD test -hreps=[10] ==> number of repititions per hsearch -autoinc ==> set to auto increase max trees -paupCmd=[/usr/local/bin/paup] ==> the path and name of the PAUP* executable This list of command line arguments is printed automatically when just typing "mILD" with no arguments. The results from the mILD analysis are stored in a folder that is named based on your input directory name. mILD WILL OVERWRITE ANY PREVIOUS ANALYSIS THAT MAY EXIST IN A DIRECTORY WITH THE SAME NAME!!! Therefore, if you wish to protect the results after an analysis, it is strongly recommended that you rename the created directory to something new. If you run a snowball, character combine criterion (0) analysis on a directory called analysis1: mILD -dir=analysis1 -s=0 mILD will create a directory called "analysis1_mILD". Depending on the analyses that you run, different results files will be present. -- The matrix, based on the input files is stored in NEXUS file that is based on the partition name, for example "_analysis1.nex" -- Jackknife results are stored in a file called "_jackknifeResults.txt" -- Snowball results are stored in a file called "_snowballResults.txt" -- A log file of the mILD analysis is stored in a file called "_logFile.txt" The result files are all preceded with an underscore, so that when sorting them using the finder or using the 'ls' command, they will always appear together (generally at the top of the list). Depending on the types of analyses that you run, different intermediate files will be created in individual folders. You can go through these individual files to examine the processes and steps that are followed by mILD. They may be used for additional post-processing tasks, but are meant for expert users and for development purposes only. The resulting NEXUS file contains all the partitions into a singular file, and also contains a listing of missing taxa, combined partitions, etc., which are all features required by the mILD application. You may use the NEXUS file for other types of comparative analyses that involve partitioned data. An effort has been made to make the NEXUS file follow a human-readable format. All the results files are tab-delimited, thus enabling easy importability into common applications, such as Microsoft Excel. **** SUMMARY OF OPERATIONS USING EXAMPLE FILES If you download the example files, you will have a folder called "analysis1" that can be used for trying out the basic analyses of mILD: 1. mILD -dir=analysis1 -s=0 2. mILD -dir=analysis1 -s=1 3. mILD -dir=analysis1 -jk 1-- snowball analysis using partition count criterion 2-- snowball analysis using character count criterion 3-- jackknife using LeCointre taxon jackknife method For each of these examples, a directory called "analysis1_mILD" will be created that contains the result files noted in the previous section