Data analysis - documentation : Pipeline IT requirements
This page last changed on Jun 05, 2007 by maising.
The pipeline does not have a strict set of IT requirements. We have found that many of our early access customers have different opinions on thse matters and different existing IT setups, which in turn has led them to different storage solutions and different data management. Therefore we are reluctant to dictate a "correct" solution. Nevertheless, this page attempts to give some guidelines. Disk space and network infrastructureThe way the Solexa machine works is that images are acquired and stored on the instrument and must then be transfered to an external computer to be analysed by the analysis software, which handles image processing, base-calling and sequence alignment. The main issue to be aware of is that each instrument run generates ~1TB of data during a full 2-3 day run. However, ~70% of this is TIFF image data that can potentially sent to tape after a run is finished and you are satifisfied and a reanalysis is not required. These large data volumes mean that you will need
Analysis computerThe analysis software was designed to run on a range of computer architecture and to fit into existing IT infrastructures as seamlessly as possible. The following guidelines may be helpful: Operating systemThe software should run on all common Unix/Linux variants. It may also run on other platforms, including MacOS X. However, we do not support any plaftorm other than Linux at the moment. For your reference, internally we are running CentOS 4.1, even though there is no reason to believe that any other Linux distro would be less suited. HardwareA high-end box (e.g. a dual-processor, dual-core Intel Xeon or AMD opteron) is the minimum requirement as an analysis computer; on this type of hardware you can expect to turn around the image analysis and base-calling of a full run in ~ 1 day. Sequence alignment would take additional time; depending on which alignment program you're running probably somewhere between a few hours (using our fast short-read whole-genome alignment program Eland) and days (using more traditional alignment programs). Pipeline parallelisation is built around the multi-processor facilities of the make utility and scales very well to well beyond 8 nodes; substantial speed-up is expected even for parallelisation across several hundred CPUs. The pipeline can make use of multi-processor systems (SMP) and clusters. Internally we are using Sun Grid Engine as a job management system, and a full 1 GB run-folder is typically analysed in around 6 hours on 7 dual-processor, dual-core Opterons. More information on parallelisation can be found in Pipeline parallelisation. |
![]() |
Document generated by Confluence on Dec 19, 2007 18:32 |