I've started writing a tutorial on how to use XCluster. Feedback is appreciated.

Introduction:

The new version of the clustering software, XCluster, grew out of the desire to make the software far less memory intensive, faster, and smarter when joining two nodes together, such that most similar outermost expression patterns of said nodes are placed next to each other.

This software, and the underlying source, are freely available to all researchers from non-profit and academic institutions. You may register and download the software from the
Stanford Microarray Database site.

For commercial entities interested in the software, there is a non-exclusive ready-to-sign license agreement, which is available in PDF format here. Further enquiries may be directed to Imelda Oropeza (Tel : 650-725-9039) at the Office of Technology Licensing. If you would like a 30 day free trial, please register and download the software from the Stanford Microarray Database site.

Please email Gavin Sherlock if you have any questions/feature requests etc.

Most of the files that are output by the clustering program are readable by TreeView (.cdt, .gtr, .atr), which allows you to browse clusters. TreeView is available free to academic users here. Commercial users should direct enquiries to Jessica Smith. Some additional files created by XCluster, eg in generating SOMs, and k-means clustering are for informational purposes. The overall SOM and k-means structures are not viewable in TreeView, but the individual clusters, which comprise .cdt, .gtr, and .atr files are. I have written a web based SOM/k-means viewing application in Perl, which is detailed here. However you'll require a reasonable amount of knowledge of both Perl and UNIX to install it in a cgi directory on your Web server, which I can't help you with (though your system administrator might!). In the not too distant future I intend to write a description of the algoritms, and some advice on using them, so watch this space!

Usage:

Mac or PC

Simply double-click the application, and answer the questions as they come up. Your input file must conform to the expected file format, described below.

UNIX

You can either simply type cluster, and answer the questions as they come up, or use the following command line arguments:


 * -f filename
 * -g 0|1|2 ; 0 indicates no gene clustering, 1 indicates non-centered metric, 2 indicates centered metric when clustering genes.  1 is default.
 * -e 0|1|2 ; 0 indicates no experiment clustering, see above for 1 and 2.  0 is the default.
 * -p 0|1 ; whether to use pearson correlation (1), or Euclidean distance (0).  1 is the default.
 * -s 0|1 ; whether to make a SOM (1). 0 is the default. 
 * -x specify x dimension of SOM
 * -y specify y dimension of SOM
 * -r 0|1 ; whether to seed the random number generator with the time when making a SOM.  1 is the default.
 * -k num ; how many k-means clusters to make.  num is an integer, greater than 1, and preferably less than a gazillion. 
 * -l 0|1 ; whether to log transform the data.
 * -u string ; a unique identifier by which to name the output files, instead of basing their names on the input file.  eg -u 888 will generate 888.cdt as an output file.

Known problems/limitations/future plans:

- The file format is very inflexible.  I'm toying with making it allow
  a variable number of columns prior to the experimental data.  Don't hold
  your breath though.  Would really like to transition to a flexible XML
  format, which Paul Spellman is (supposedly) defining.  Will require
  a loty of new code though....

- sometimes it can happen that a k-means clustering goes on forever.
  Obviously not a very useful feature!.  I'll investigate this when I
  have time.

- future plans include:

    1. Making the program automatically optimize the number of
       clusters for both k-means and SOMs.  I've actually implemented
       a couple of methods for this in my development version, but
       neither of them was satisfactory - Tibshirani has come up with
       a new method, that I'll check out implementing as time permits.

    2. Additional optimizations to CalculateCorrelation function.
       I've actually done this, but whilst it does make the code about
       15-20% faster, it also makes it unreadable, so I've backed off
       from this for the moment.

    3. Add in non-parametric distance measure, such as Spearman Rank
       Correlation coefficient.

    4. Add in data filtering, based on Orly's work.
       This has also been done, but is still considered as 'in
       testing', and will have to wait until interpolation of missing
       data is done, so that the full data set can be used.

    5. Split the source up into several source files

    6. Add in the jacknife correlation, and the 'Quality Cluster'
       Method of Heyer et al, which is a pretty cool method.
 
    7. Add in the divisive clustering method of Alon et al.
       Haven't yet investigated thoroughly how easy/hard this is.

    8. Add in autoweighting, as already implemented in Mike Eisen's
       clustering program.


History:

New (as of 9-21-99) - Finished version 2.51 Minor changes to code, to streamline parsing of user input. - Made it so that genes don't have to be clustered. Old (as of 7-19-99) - Finished version 2.5 Has the ability to generate k-means clusters. - Added in code to log transform the data (in base 2) in case it wasn't already log transformed. - Optimized SOM code somewhat. - Changed parsing code so that it will compile and run unmodified on all platforms. Old (as of 5-13-99) - Finished Version 2. Has the ability to generate Self Organizing Maps, which can then be viewed at http://genome-www.stanford.edu/~sherlock/SOMview.html - The input data is filtered when making a SOM, such that all expression profiles that do not vary at least 2 fold from their max to min are removed. These are printed out into a file with a .fil extension - The user has the ability to decide whether to not seed the random number generator with the current time. If they want to compare SOMs with and without clustering by experiment, this is useful, because you can guarantee the same organization if you don't seed the random number generator. - Fixed bug where using relative pathways gave the wrong output filenames. Now can use: 1. /home/users/sherlock/sample.txt OR, 2. ../../sample.txt equally well. - Did some optimisations to the CalculateCorrelation function to speed it up. Old (as of 5-6-99) - Added in the ability to choose between using Euclidean distance, or pearson correlation for use as a similarity metric. Can be set via the command line option -p, where 0 means use Euclidean, and 1 means use pearson. 1 is the default. Old (as of 5-1-99) - Found stupid and nasty bug such that even though the code was correctly switching the node orders to bring together similar genes/experiments, these switches were not reflected in the tree files. This bug was fixed, and the fix tested, and it's now better! Old (as of 4-16-99) - Was storing each correlation in two places, which Paul pointed out was unnecessary, so now only store it in one place. Makes it slightly faster. Old (as of 4-12-99) - Fixed a couple of small bugs eg. forgetting to malloc the extra byte for the '\0' in ** all ** the strings! - made the code somewhat more readable, since: *(darray+x*y) is equivalent to darray[x*y] the second version is a lot more readable I think. - Made linked list implementation more efficient, such that when a correlation is to be inserted into a full list, then the memory of the lowest correlationRec is reused, and the pointers manipulated accordingly. Saves a call to malloc and free. - Added the ability to take command line arguments, so can make automatable: Usage: cluster -f filename /* filename (duh!) */ -g 1|2 /* genes: 2=use centered metric 1 is default */ -e 0|1|2 /* experiments: 0=don't cluster 1=cluster 2=use centered metric 0 is default */ eg. cluster -f sample.txt -g 2 -e 2 cluster -f sample.txt cluster -e 2 -f sample.txt ...you get the idea If you supply any command line arguments, you must supply the filename. Old ! (as of 4-12-99) The clustering program is implemented in C. To use resources more efficiently it only stores the 13 highest correlations for any particular gene (I played around with this number on a set of ~6200 genes by 80 experiments, and 13 was the fastest). Also, since sqrt(a) * sqrt(b)=sqrt(a*b), I was able to save a call to sqrt. The program, was compiled using: gcc -Wall latest_cluster.c -o cluster -lm -O3 -ansi As a couple of benchmarking exercises, cluster takes: 11 seconds: 800 genes x 82 experiments (alberich) 8 minutes: 6200 genes x 82 experiments (alberich) 1 hour: 16384 genes x 250 experiments (daisy) The program prompts the user for an input file, asks whether the data should be clustered by experiments, in addition to genes, and asks whether to cluster the data using a centered correlation metric. The progress is given by indicating every 100 correlations that have been calculated, and every 100 nodes that have been clustered (for genes) or every node that has been clustered (for experiments). The program then outputs 2 or 3 files, identically named as the input file, but with the extensions .cdt, .gtr, and .atr (if experiments were clustered).

File formats:

The program expects a specific input file format. If this is violated the program will give unexpected results. If the input format specified here is unsatisfactory, or too limiting, we can discuss making changes. Line 1: UID\tNAME\tGWEIGHT\tExp1\tExp2...Expn Line 2: EWEIGHT\t\t\t1\t1\t...1 Line 3: ORF1\tNAME1\t1\t0.23\t0.45\t...0.43 Line 4: ORF2\tNAME2\t1\t0.56\t-0.10\t..0.34 ie: The first line contains three columns (they can be named anything), followed by the names of all the experiments (tab delimited). The second lines contains 3 columns (they can be named anything), followed by the EWEIGHTS for each experiment (usually have the value 1, but treated internally as a float). The third, and subsequent lines have the ORF name as the first column, the gene name as the second column, and the GWEIGHT as the third column (again, whose value is usually 1, but is treated internally as a float). The ORF name and GENE name fields can be up to 1024 characters in length, so can be descriptive data instead. The remainder of the line is made up of tab delimited experimental data. Where there is a null data point, there will be two tabs next to each other. an example file is : UID NAME GWEIGHT spo0 spo30 spo2 spo5 spo7 spo9 spo11 EWEIGHT 1 1 1 1 1 1 1 YAL003W EFB1 1 0.23 -1.79 -1.29 -1.56 -0.27 YAL004W YAL004W 1 0.41 -0.38 -0.89 -1.06 -1.6 -1.84 -1.6 YAL005C SSA1 1 0.61 -0.07 -1.29 -1.29 -2 -1.84 -2.25 YAL010C MDM10 1 0.16 -0.15 -0.76 -1.25 -1.89 -1.74 -1.6 YAL012W CYS3 1 0.03 1.39 -0.84 -1.64 -2.84 -2.47 -2.4 YAL015C NTG1 1 -0.18 -0.18 -0.62 -1.32 -1.69 -1.43 -1.79 ...etc Ouput files: The .cdt file looks like this: GID YORF NAME GWEIGHT spo0 spo2 spo30 spo11 spo9 spo7 spo5 AID ARRY0X ARRY2X ARRY1X ARRY6X ARRY5X ARRY4X ARRY3X GENE37X YBR025C YBR025C 1 0.50 -1.94 -2.64 0.73 0.51 -0.18 -1.00 GENE26X YBL042C YBL042C 1 0.26 -1.06 -2.06 1.58 1.33 0.54 -0.43 ...etc where everything is tab delimited. The second line is only printed if experiments, as well as genes were clustered. The ARRY0X, and GENE37X are used by the TreeViewer program to determine how to draw the binary trees, in conjunction with the .gtr and .atr files (below): .gtr: NODE1X GENE51X GENE24X 0.997965 NODE2X GENE41X GENE40X 0.997813 NODE3X GENE90X GENE89X 0.997782 NODE4X GENE84X GENE32X 0.996293 NODE5X GENE8X GENE3X 0.995884 NODE6X GENE91X GENE54X 0.995747 NODE7X GENE80X GENE42X 0.995286 NODE8X GENE70X GENE31X 0.994681 NODE9X GENE63X GENE36X 0.994233 ...etc .atr NODE1X ARRY4X ARRY3X 0.948027 NODE2X ARRY6X ARRY5X 0.940996 NODE3X NODE2X NODE1X 0.902631 NODE4X ARRY2X ARRY1X 0.785001 NODE5X NODE4X NODE3X 0.662225 NODE6X NODE5X ARRY0X -0.082621
If there's any questions over:
  1. File formats
  2. Software
  3. Anything else!
Please ask me.

The program doesn't appear to have any bugs per se, but it probably won't cope well with badly formatted files. It also expects every gene to have an Orf, or UID in the first column. It will error out if this is not the case. For the second (name) column, this is not fatal, as long as there is an empty column. The code itself is pretty well commented, though some of the comments are slightly out of date, as I changed functions during development.

I'm also well aware that each function has probably too many arguments, and that I should probably subsume a lot of things into the clusterRec, to only have to pass one thing around, although the increased over head of having to dereference an extra pointer each time may or may not be trivial. I also realize that the ClusterNodes and ClusterExperiments functions should be unified to make code maintenance easier. I'll try and get round to this. The source code is available and the same code should essentially run unmodified under any system (UNIX, Mac, PC, Linux) with a simple recompilation. Let me know if you run into problems. Gavin