This page last changed on Oct 30, 2006 by maising.

Run-folder Proposal

Clive G. Brown v1.0 Jan 2005
Klaus Maisinger/CGB v1.1 Mar 2005
KM v1.2 Jun 2005
KM v1.3 Nov 2005
KM v1.4 Jan 2006
Tony Cox v1.5 Feb 2006
KM v1.6 May 2006
KM v1.7 Jul 2006
KM v1.8 Oct 2006

The purpose of this document is to propose a standardised 'run-folder' structure, file naming conventions and file formats for all Solexa sequencing instruments. These standards will allow for:

  1. Single point of data storage, logging and analysis output both during and after a machine 'run'.
  2. Encoding enough information for the history of the data within the run folder to be tracked and traced back to laboratory notebook without confusion between machines, experimenters and sites.
  3. Provide standardised input and output for component software to operate on without error, regardless of the instrument (and variants) generating the data.
  4. Encode and capture enough information for the data processing to be independently re-run on the contents of the folder at any time in such a way that existing 'extractions' of sequence and related data are preserved and any parameters used during any point of the 'extraction' process are captured and related to the subsequent output data.
  5. Subsequent analyses are also stored in the run-folder.
  6. Our software tools and other 'client' software both implement and enforce these structures and standards.

The main benefits of standardising the run-folder will be:

  1. Easier and quicker development of automated pipelines to operate on the data (regardless of instrument type). The functionality of such pipelines will be a subset of the final instrument software.
  2. Easier and quicker migration to a LIMS in future. Provided sufficiently unique and descriptive tracking ID's, tags, keys and composite keys are encoded in the run-folder structures and files.
  3. The same run-folder will be used on final sequencing instruments and satisfies a universal requirement for transparent access to self-descriptive data by end users and instrument integrators.

Run-Folder naming

A 'run-folder' captures the output of an instrument run. An experiment is defined as the process of sequencing a specific chip with a constant set of samples through multiple enzymology cycles. Although an experiment will usually correspond to a single instrument run, it may also consist of an arbitrary number of multiple runs, for example if a run had to be interrupted due to instrumental problems, or if it was decided to supplement an existing run by sequencing more bases of the sample on the chip.

It is highly desirable to keep Expt-Ids (or Sample-ID) and instrument names unique within any given enterprise. Ideally one would want a convention under which each machine is able allocate run-folder names independently of other machines without fear of causing naming conflicts.

Therefore on the current generation of instruments run-folders are named according to the convention

YYMMDD_machinename_XXXX

i.e. consisting of three underscore-separated fields, where:

  1. Field 1 is a 6 digit number specifying the date of the run. The YYMMDD ordering ensures that a numerical sort of run-folders by this field places the names in chronological order.
  2. Field 2 specifies the name of the sequencing machine. It may consist of any combination of upper or lower case letters or digits or hyphens and may NOT contain any other characters (especially not an underscore). Implicitly it is assumed that the sequencing machine is synonymous with the PC controlling it, and that whoever names sequencing machines ensures that the names they choose for each machine are unique across the sequencing facility (which in our case comprises two sites).
  3. Field 3 is a 4 digit code specifying the experiment ID on that machine. Each machine should be capable of supplying a series of consecutively numbered experiment IDs (i.e. it needs to maintain some sort of incremental unique index, probably from the onboard sample tracking database). It might also be desirable to have the option for the machine to obtain the experiment ID to use from a LIMS.

For example, the run-folder '060114_slxa-b1_0147' would be the experiment number 147 run on Beta instrument 1 (network name 'slxa-b1') on the 14th Jan 2006. Whilst the date and machine name are themselves likely to specify a run folder uniquely for any number of instruments in any institution, the addition of an experiment ID ensures both uniqueness and the ability to relate the contents of the run-folder back to any lab notebook or LIMS.

In this example the name can be split into its composite parts using patterns /_/ (and then /_/ on the second element of the subsequent array). The resultant composite key and run-folder name is known as the Run-ID. This Run-ID should also be used in the onboard database (or used as a unique run name and associated with a unique numerical internal primary key).

Additional information can be captured in the run-folder name in additional fields, separated from the compulsory first three fields by additional underscores. For example, it is becoming an accepted standard to capture the flowcell number in the run name:

YYMMDD_machinename_XXXX_FCYYY

Note: Historically, the top-level run-folders for UK experiments have been named using an 8-digit unique 'Experiment/Run-ID'. For example, a folder may be named '00002559' for experiment number 2559.

Run Folder Structure

The Run Folder will contain subfolders call 'Images', 'Data', 'Logs' and 'Analysis'.

Image Folders

The Images folder contains a subfolder for each chip lane that has been sequenced. The folder names are of the form <Sample-ID>_L<lane number>, where the lane number is padded to 3 digits with leading 0s. If no sample-ID is known, only the lane number is used, for example 'L001' contains the images taken in the first lane.

Each lane folder contains a subfolder for each cycle of sequencing. Each image-cycle subfolder contains the images for every tile and every colour from a given cycle of sequencing. The image folder naming follows the convention C<cycle number>.<version number>. Cycle number is 1 indexed and represents the nth cycle. Version number allows for a cycle to be re-attained should say, the image acquisition be performed more than once or the machine paused and a cycle repeated. For example this would yield folders 'C1.1' and 'C1.2' if on the first cycle images are acquired twice. De-blocking cycles are stored in a folder named D<cycle number>.<version number>, for example D2.1 is the first scan of the de-blocking step after the second cycle. (Revision numbers are currently not implemented in the instrument software.)

Within each image-cycle subfolder we store a .tif for each tile. These are named as <sample>_<lane>_<pos>_<base>.tif, for example 's_1_67_g.tif', where 's' is the default sample-ID used at Solexa-UK. Sample-IDs must not contain any underscores '_', as these are used as separators between the different components of the filename to allow easy splitting by any software reading these filenames. (NB: The original proposal was to name the images '<sample>_<lane>_<pos>_<cycle>_<filter>_<base>.tif'. There are plans to move to a new scheme '<sample>_<lane>_<row>_<column>_<base>.tif' soon, which takes into account the rows and columns of the scannnig arrangement along the lane.)

Also in this folder is a .params file which describes any parameter values specific to the images in this folder. The format of this type of file is dealt with later. (NB: No such parameter file has been implemented yet. It is unlikely that it will ever follow this proposal in detail, but there will certainly be XML files describing the data. Also TIFF headers may be used to store additional information.)

The Data Folder

The Data folder contains a variable number of subfolders. A new subfolder is generated each time a batch of images is processed by an image analysis package. Generally, these data are kept in one file per tile, for raw intensities, extension _int.txt. For each run of an image analysis package, there may be multiple additional subfolders holding the output based on this image analysis that is generated by a base-caller package. The base-calls are kept in one file per tile for the concomitant base-calls, extension _seq.txt. The base called file has the same name as the intensity file from which it was generated.

Image analysis Folders

The Data folder contains a .params file with multiple records corresponding to each subfolder which has been generated as a result of analysing sets of images. (The detailed information about the image analysis is thus stored one level above the corresponding data in the directory hierarchy. This allows a user to browse the different runs of the image analysis package easily without having to descend into the subfolders.) This .params file explicitly records which cycle-image folders were used to generate a given _int.txt file and any parameters used. It should also record the name of the subfolder and the individual files within it. The format of the .params files is explained in more detail in later sections of this document.

Each of these subfolders is named using C<first cycle>-<last-cycle>_<software><software-version>_<date>_<user> and, if required, a version number. For example, 'C1-27_Firecrest1.8.20_31-07-2006_myuser.2' contains the second version of an analysis of cycles 1-27 performed using version 1.8.20 of the Firecrest software, run by the user 'myuser' on the 31st of July 2006.

Sequence Folders

Each image analysis folder may hold multiple subfolders with the output of different runs of a base-caller package. Each of these subfolders is named <software><software-version>_<date>_<user>[.<version-number>], for example 'Bustard1.8.8_08-11-2005_myuser.3' for the third run of the Bustard base-caller on 8th November 2005 by the user 'myuser'.
Each image analysis folder also holds a .params file that records any relevant information about the runs of a base-caller and the generation of the _seq.txt file(s) in the subfolders.

The Intensity Files

The prefix of the intensity filenames follows the prefix of the image filenames, but the tile position is zero-padded to 4 characters; e.g. 's_1_0006_int.txt' is the intensity filename corresponding to the image files 's_1_6_a.tif', 's_1_6_c.tif' etc. Each _int.txt file has a set of data for remapped spots on each line, standard Unix '
n' at the end of each line. Each row corresponds to the data from one 'spot' and each column is delimited by white-space. Each row has a channel index, and a tile index in the first column, with the X offset and Y offset of the spot in the second and third columns (all coordinates indexed from zero). These fields are TAB-delimited. Note: These values should be enough to uniquely identify any spot for any run.

Following the co-ordinate fields are the data fields. The first value in a data field is the raw intensity for the base A, the second the raw intensity for the base C, then G and T. These four values are separated by spaces and are followed by a TAB to mark the beginning of the next four values. The next four values represent the corresponding intensities in the de-block scan (if such a scan has been performed), then four intensities for the next cycle etc.

The number of fields should equal four (coordinates) plus four (four bases) times the number of cycles of processed data even if spots don't yield data at the end of a set of cycles or part way through. In this case we leave the fields with 0.0 - but preserve delimiters.

A line from a _int.txt file:

<Channel><TAB><Tile><TAB><X><TAB><Y><TAB><A int 1st cycle> <C int> <G int> <T int><TAB>\
<A int 2nd cycle> ... <LF>

e.g. for 3 cycles of data, from the second tile of lane 1,the spot is at position offset 10,20 within the tile and where the spot was missed on the second cycle

1	2	10	20	112.3 1.1 10.1 0.2	0.0 0.0 0.0 0.0\
10.2 0.01 94.4 21.1

A second set of files with identical layout stores the estimates of the noise on the intensity estimates; these files end in _nse.txt.
Note: This intensity file format is not actually used internally by the primary data processing software; see the section on efficiency. It is not possible that this format may be phased out in the near future.

The Sequence Files

For a given intensity file, following base-calling, we have a sequence file of the same name. For example, from an intensity file called s_1_0001_int.txt we would get a base-called file called s_1_0001_seq.txt.

Each sequence file has a sequence per row similar to the intensity files. Each row of the sequence file is the same as the intensity file, with the <lane>,<tile>,<X-offset>,<Y-offset> providing a unique key and a global co-ordinate for the sequence, relating sequences to spot on the images. Following this we have a string with one character for each base call, with the fields as before delimited by TABs. Another file holds the base-caller confidence score.
Sequence format

<channel><TAB><tile><TAB><x><TAB><Y><TAB><sequence><LF>

e.g. for the 3 cycles of data shown above

1	2	10	20	A.G

Standard IUPAC sequence ambiguity codes will be used where appropriate. Note: Ambiguity codes are not implemented. Missing base-calls are represented by a '.' - at the moment the only reason for this is that a cluster could not be observed because a proper image alignment was not possible or the cluster has shifted off the image in a later cycle.

The Quality-score Files

Each intensity and sequence file has an associated file containing the quality scores for the base calls. For example, the quality file for the sequence file s_1_0002_seq.txt is called s_1_0002_prb.txt. For each base call in the sequence file, there are 4 quality scores corresponding to the four bases in the order A,C,G,T. These fields are separated by blanks. Each set of four scores for one base call is separated from the next set by TABs.
For example:

  30 -30 -30 -30	-22   20  -27  -30	-17   17  -30  -30

might be the scores for a sequence AGT. The scores are defined as Q=-10*log10(p/(1-p)), where p is the probability of a base call corresponding to the base in question. By definition, the 4 values of p add up to 1. (NB: The quality scores produced by programs such as phred/phrap are defined as Q=-10*log10 (1-p). The drawback of the phred convention is that the resolution of the log-scale for low-quality base calls is extremely poor. For large values both defintions converge asymptotically.)

The Log Folder

Note: Log folders have not been implemented. The instrument software currently follows different conventions.

The log folder contains a set of files that record useful information about the progression of the various software packages. There should be an error.log file to which any package writes a single line of text recording, in human readable form, any errors, error codes and other information, say date and time, that a client program may throw.

We suggest the error log follows the standard format of

<level><TAB><TAB><Date><TAB><time><TAB><Source/program><user><TAB><host><TAB><Message><LF>

We also expect a run.log file to record run information, events, messages and other data produced by the instrument or its sub-components and drivers.

The folder can also contain log files specific to each application. In which case the client application needs to generate a unique name for its .log file. Contention and locking should not be a problem with these files provided they are being appended to.

The Analysis Folder

Note: Analysis folders have not been implemented. Most people use the top-level run-folder directory to store additional information.

The intention behind this folder is to allow users of the pipeline software and other tools to write results (e.g. spreadsheets) derived from the RunFolder data back into the RunFolder. It is free-form. However, within this folder is another 'html' which will contain the HTML formatted output from pipeline specific QC and automated analysis tools provided by Solexa.

Permissions

The RunFolder will be readable by anyone but only writeable by pipeline software running under a user alias (probably 'Analysis') or by selected members of the informatics group. The Analysis folder will be read/writeable by anyone (except for the HTML subfolder).

Efficiency

To allow efficient handling by any software packages, there is one intensity and sequence files per tile. However, a single file can easily be created by simple concatenation of the individual files. The files will be designed to be 'cat' and 'paste' friendly.
(NB: Currently the image analysis and base-call packages use intermediate file formats with only one set of four intensities per row (int.txt), four noise values per row (nse.txt) or four quality scores per row respectively. Each of these files holds the values obtained in a single cycle, and thus one needs as many such files as there are cycles. The files are named s_1_0002_03_int.txt for the second tile in the first channel, third cycle. This format is for efficient random-access use by the analysis software. These intermediate files are stored in subfolders called 'Firecrest' and 'Bustard' respectively, following the names of the analysis software packages. The idea is that these subfolders can be safely removed if storage space becomes an issue, as the information stored in these folders is mostly redundant with the files in the higher-level folder.)

Parameters

The top level Run Folder, the Data folder, its subfolders and the top-level Image folder can all contain a .params file. This file is intended to contain any meta-data or parameter data specific to given level of information held in the folder. It is a simple XML file that is compatible with the output produced by the Perl XML::Simple module to allow easy modification from scripting languages. (A primitive Python implementation of the XML::Simple is part of the Solexa data analysis pipeline.) However, any XML-aware software should be able to manipulate parameter files easily. XML tags can be decided upon as needed but once implemented need to be fixed and finalised universally. (However, since it is reasonably easy to transform XML tags may still change during the development phase. In fact, the names of these tags are probably unlikely to survive forever.) Where a tag has multiple values the order of those values needs to be constant. Any tags that a given piece of analysis software does not understand should be left untouched.

The top level Run Folder contains a .params file which is named using <Run Folder Name>.params. Its format is as follows;

<experiment>
  <run>
     ...
  </run>
  <run>
    ...
  </run>
</experiment>

For each restart of the instrument a new run tag with corresponding parameter tags is added to the parameter file. While multiple runs in the same experiment are still frequent on the development platform, for most real experiments there will only be one run.
The XML tags used in the parameter file are self-explanatory; rather then listing them all, the following shows an example of a real parameter file.

<experiment>
  <run>
    <instrument>slxa-b1</instrument>
  </run>
</experiment>

In each image-cycle subfolder we have a similar .params file recording meta-information specific to the data in that folder. (NB: This is not implemented yet.)

In the top level of the Data folder we have a .params file that records any information specific to the generation of the sub-folders. This would at least contain a tag-value list describing the cycle-image folders used to generate each folder of intensity and sequence files, e.g. for a case where two sets of intensity and sequence subfolder have been generated.

<?xml version="1.0"?>
<ImageAnalysis>
  <Run Name="C1-24_Firecrest1.8.17_30-05-2006_myuser">
    <Cycles First="1" Last="24" Number="24" />
    <Offsets X="0.000000" Y="0.000000" />
    <Offsets X="0.790000" Y="-0.620000" />
    <Offsets X="-0.250000" Y="0.040000" />
    <Offsets X="0.190000" Y="0.800000" />
    <Parameters>12.000000</Parameters>
    <Parameters>3.000000</Parameters>
    <Parameters>1.900000</Parameters>
    <Software Name="Firecrest" Version="1.8.17" />
    <TileSelection>
      <Lane Index="8">
        <Sample>s</Sample>
        <Tile>10</Tile>
        <Tile>20</Tile>
        <Tile>30</Tile>
      </Lane>
    </TileSelection>
    <Time>
      <Start>30-05-06 12:50:45 BST</Start>
    </Time>
    <User Name="myuser" />
  </Run>
  <Run Name="C1-24_Firecrest1.8.17_30-05-2006_myuser.2">
    <Cycles First="1" Last="24" Number="24" />
    <Offsets X="0.000000" Y="0.000000" />
    <Offsets X="0.790000" Y="-0.620000" />
    <Offsets X="-0.250000" Y="0.040000" />
    <Offsets X="0.190000" Y="0.800000" />
    <Parameters>12.000000</Parameters>
    <Parameters>3.000000</Parameters>
    <Parameters>1.900000</Parameters>
    <Software Name="Firecrest" Version="1.8.17" />
    <TileSelection>
      <Lane Index="5">
        <Sample>s</Sample>
        <TileRange Max="70" Min="5" />
      </Lane>
    </TileSelection>
    <Time>
      <Start>30-05-06 18:01:49 BST</Start>
    </Time>
    <User Name="myuser" />
  </Run>
</ImageAnalysis>

In each image analysis folder there is another parameter file containing the meta-information about the base-caller runs.

<?xml version="1.0"?>
<BaseCallAnalysis>
  <Run Name="Bustard1.8.17_30-05-2006_myuser">
    <Cycles First="1" Last="24" Number="24" />
    <Input Path="C1-24_Firecrest1.8.17_30-05-2006_myuser.2" />
    <Matrix Path="/data/matrix.txt">
      <Base>C</Base>
      <Base>A</Base>
      <Base>T</Base>
      <Base>G</Base>
      <Element>1.010000</Element>
      <Element>1.030000</Element>
      <Element>0.000000</Element>
      <Element>0.000000</Element>
      <Element>0.180000</Element>
      <Element>1.000000</Element>
      <Element>0.000000</Element>
      <Element>0.000000</Element>
      <Element>0.000000</Element>
      <Element>0.000000</Element>
      <Element>1.460000</Element>
      <Element>0.920000</Element>
      <Element>0.000000</Element>
      <Element>0.000000</Element>
      <Element>0.000000</Element>
      <Element>0.840000</Element>
    </Matrix>
    <Parameters>0.000000</Parameters>
    <Parameters>0.000000</Parameters>
    <Software Name="Bustard" Version="1.8.17" />
    <TileSelection>
      <Lane Index="5">
        <Sample>s</Sample>
        <TileRange Max="5" Min="5" />
      </Lane>
    </TileSelection>
    <Time>
      <Start>30-05-06 18:01:50 BST</Start>
    </Time>
    <User Name="myuser" />
  </Run>
</BaseCallAnalysis>

Summary

In summary, bringing all these elements together, for a typical run we should have something like:


Runfolder Notes v1.3.doc (application/msword)
Runfolder Notes v1 4.doc (application/octet-stream)
Runfolder Notes v1 5.doc (application/octet-stream)
runfolder.PNG (image/x-png)
Document generated by Confluence on Mar 09, 2007 16:11