Fish the ChIPs


1. How to install the Pipeline

 
1.1 Installed the pre-compiled pipeline on Mac OS X (10.6+)

Video tutorial:


We provide two different packages, one for Snow Leopard (Mac OS X 10.6) and one for Lion (Mac OS X 10.7). Depending on the system running on your machine, you have to download the proper package. Once you have downloaded the installer (dmg file), double click the icon to mount the image.

 

 

In this way, a white disk-shaped icon will appear on your desktop. Double click it, and then double click install.command.

 

 

At this time, a shell window will open up. You just have to follow the instruction on the screen to get a complete installation. First of all, you must enter your administrator password.

 

The installation will then run for a few minutes depending on the performance of the machine.

 

 

Once all the steps are completed, the installation script will ask you for your favorite genome, in order to download the corresponding bowtie index. To perform this step, you will need an Internet connection to be active on your computer. Genomic indexes are quite heavy, so the operation could take minutes to hours depending on the bandwidth of your Internet Connection.

 

 

If the downloading step goes fine, you should see a message like the one below.




Please notice that FTC requires R. R will be copied on your system if no previous R installation is found. In any case, no previous R installation will be overwritten. Please also notice that Snow Leopard lacks Python version 2.6.6 or greater, so its installation is straightforward on these systems and FTC will perform it automatically. Please notice that the installation of this specific Python release on your system will not affect your default Python installation.

1.2 Build the pipeline from the source code

First of all, you must keep in mind that the g++ compiler, Python 2.6.6 or later and whatever version of R and Perl must be installed on your system in order to have the pipeline working properly. Installation process was tested on g++ compiler version 4.1.2.

After decompressing the tarball, open the shell, enter the folder that has just been created and type ./install.sh and follow the instructions on screen.




2. How to configure and run the Pipeline


2.1 Running the pipeline on Mac OS X

Video tutorial:


Once you installed the application, you will find three new icons on your desktop:

1. FCGUI: this opens the FC graphic user interface (see below);

2. FC_working_dir: this is the place where you should copy your raw data and where FC will find the results of the analyses;

3. download_indexes: this opens an interface through which you can download further bowtie indexes for short-reads alignment.
All the parameters can be set directly into the FCGUI so in this case the user do not need to generate the configuration file using the web interface.

The image below shows the main FCGUI form.



In the upper part, the user can modify the program and the working directories. On the right panel, species must be specified as well as the steps that must be included in the pipeline run.

In the middle panel the user can enter the samples. For each sample, a name must be specified. In order to avoid unexpected errors when running the pipeline, do not use full stops or special chars in the names of the samples.




Then by clicking on the + button the user can select the corresponding raw file.

 

 

In case of replicates the user can add more raw files simply by selecting the sample and clicking on “Add replicate to sample”.
Once all the samples have been added, the user can queue all the comparison needed.





 

After that, all the compulsory parameters have been entered and you can run the analysis clicking the button “Launch the analysis”. The cursor will then move to another sheet named “Results” in which you can monitor the progresses of your analyses.
 


 

Every time a partial step is completed, you will be informed through the left panel.

 

 


2.2 Automatic generation of the configuration file

For all the users that will not take advantage of the Mac GUI, a user-friendly public tool to generate the configuration text file is available here.

The user just have to fill the form with all the parameters and then press the Download button in order to get the corresponding text file. Explanation of each single parameter is provided beside the form.

In the following paragraphs we provide some screenshots and details of the compulsory parameters the user must specify in order to generate a suitable configuration file.

First of all, user must detail which steps of the pipeline will be run, the FC directory (the folder where the pipeline has been installed) and the working directory (that is the folder where the user stores the raw data).

 

Species is another compulsory parameter. FC already includes all the information for complete analysis of human and mice data. The following step consists in specifying the samples filenames and assigning a name to each one of them.


Then the user is meant to specify all the pairwise comparisons to be performed.

 

The last compulsory parameter you have to specify is the tag size.

 

Pushing on the button “Download configuration file” will start the download of the config.txt file. This must be cut and paste in your working folder.






2.3 Detailed description of configuration file fields

 

This section is aimed at explaining the details of every single parameter that must be specified in the pipeline configuration file.

@debug_mode=
This is a flag to turn on the debug mode. It is a Boolean (yes/no).

@working_dir=
This is the path of the folder where you are working in. The sample files must be there. Remember to add the final slash to this path.

@FC_dir=
This is the path of the folder where you installed FC.

@do_raw_conv=
This can be set to:
-­‐ from_srf, in case you need to perform srf-­‐>fastq conversion;
-­‐ from_sra, in case you need to perform srf-­‐>fastq conversion;
-­‐ no, if no conversion is needed.

@do_fastqc=
@do_alignment=
@do_peak_finding=
@do_peaks_annot=
@do_bigwig=
These are flags (yes/no) that specify if every single step of the pipeline must be run or not.

@do_copy_tracks=no
@do_tracks_to_db=no
These are flags that must be always set to no. These are reserved for further functionalities that will be included in future versions of the pipeline.

@srf2fastq=
This is the full path of the tool aimed to convert the raw files (e.g. srf) to fastq format.

@fastq_dump=
This is the full path of the tool aimed to convert the sra files to fastq format.

@filter_low_quality_reads=
This is a flag that specifies if the low quality reads must be discarded during the conversion of the raw data to fastq.

@specie_name=
This specifies the release of the genome for the organism the data is coming from (e.g. mm9, hg18, hg19).

@bowtie_alignment_mode=
You can either choose to run bowtie in –v or –n mode (this parameter can indeed have values v or n). From the bowtie manual, “In -v mode, alignments may have no more than V mismatches, where V may be a number from 0 through 3 set using the -v option. Quality values are ignored. The -v option is mutually exclusive with the -n option. […] In –n mode, N mismatches are permitted in the "seed", i.e. the first L base pairs of the read (where L is set with -l).”

@bowtie_v=
This is the maximum number of mismatches allowed in –v mode. It can have values from 0 to 3.

@bowtie_m=
In –v mode, bowtie suppresses all alignments for a particular read if a number of reportable alignments higher than the value specified for this parameter exist for it.

@bowtie_n=
In –n mode, this specifies the maximum number of mismatches permitted in the "seed".

@bowtie_l=
In –n mode, this is the "seed length"; i.e., the number of bases on the high-quality end of the read to which the -n ceiling applies.

@bowtie_e=
In –n mode, this represents the maximum permitted total of quality values at all mismatched read positions throughout the entire alignment, not just in the "seed".

@bowtie_trim3=
Number of bases to be trimmed at the 3' of the short reads before alignment is performed.

@bowtie_quals=
This parameter specifies the kind of input quality scores in your input files. It can have value phred33, phred64, solexa (Illumina Genome Analyzer with a pipeline version <1.3), solexa1.3 (Illumina Genome Analyzer with a pipeline version >=1.3) or integer.

@bowtie_color=
This flag (yes/no) indicates if alignment must be performed in colorspace (in case you are using data coming from SOLiD). Read characters are interpreted as colors. The specified index must be a colorspace index.

@bowtie_cpu=
This parameter specifies the number of parallel threads.

@bowtie_best=
This flag (yes/no) make bowtie guarantee that reported singleton alignments are "best" in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of -n mode) and in terms of the quality values at the mismatched position(s).

@bowtie_output_format=SAM
At this stage of development of the pipeline, this parameter is constant and it is set to the SAM format.

@bowtie_chr_dir=
This parameter contains the full path of the bowtie index for your reference genome.

@bowtie=
This must be set to the full path of the bowtie executable on your machine.

@bowtie_custom=
In this variable you can specify custom parameters for the bowtie command line that are not included in the configuration file so far (e.g. assume you have to add two custom parameters p1 and p2 with values v1 and v2, you just write in the configuration file @bowtie_custom=–p1 v1 –p2 v2).

@macs_tsize=
Length of your short reads.

@macs_gsize=
Effective size of the reference genome you used for the alignment.

@macs_mfold=
MACS selects the regions within mfold range of high- confidence enrichment ratio against background to build model. The regions must have fold enrichment lower than the upper limit, and higher than the lower limit (e.g. @macs_mfold=10,30).

@macs_pvalue=
This specifies the p-value cutoff for peak detection.

@macs_bw=
This is the value of the bandwidth used to scan the genome for building the model. You can set this parameter as the sonication fragment size expected from your wet experiment.

@macs_shiftsize=
When macs_nomodel is set to yes, MACS uses this parameter to shift tags to their midpoint.

@macs_format=
This specifies the format of the input files (bed, sam, bam, bowtie, eland, elandmulti, elandexport).

@macs_nolamba=
Set this parameter to dynamic if you want to enable local lambda or to fixed if you want to disable it.

@macs_nomodel=
This flag (yes/no) specifies if MACS has to build the model, using strand-specific information, or not.

@macs_create_wig=
This flag (yes/no) specifies MACS to generate wiggles tracks for the UCSC genome browser.

@macs=
This must be set to the full path of the MACS executable on your machine.

@macs_custom=
In this variable you can specify custom parameters for the MACS command line that are not included in the configuration file so far (e.g. assume you have to add two custom parameters p1 and p2 with values v1 and v2, you just write in the configuration file @macs_custom=-p1 v1 –p2 v2).

@fastqc=
This must be set to the full path of the FastQC executable on your machine.

@gin_priority=
This parameter has to be assigned to gene or promoter (in case of possible ambiguities/overlaps this parameter set the priority for the annotation).

@gin_distance=
If a peak is found upstream of a TSS within the number of base pairs specified by this parameter, it is annotated on it. This parameter must be set to a number preceded by a – sign (e.g. -20000).

@gin_genes_file=
This specifies the path of the UCSC table GIN is using for the annotation.

@gin=
@merge=
@pie_peaks_gd=
@tss_3utr_dist=
@wigtobigwig=
@igvtools=
These parameters specify the full path for each one of the scripts indicated.

 @tss_3utr_dist_genes_file=
This specifies the path of the UCSC bed table tss_3utr_dist is using for the annotation. 

@wigtobigwig_chr_file=
This specifies the path of the file containing the lengths of the chromosomes.

@igvtools_genome_file=
This specifies the path of the species-­‐specific file needed by IGVTools in order to generate tracks (.tdf files) for the Integrative Genomic Viewer (IGV).


Samples labels

For each sample you must specify a name. The name can be of any kind, just be careful avoiding full stops (.) and special chars. Beside the name, you must specify the path of the corresponding SRF raw file(s). If you have more than one srf file for a certain sample, you have to list all of them separated by commas.

If you are skipping some steps, you still must specify here the names of your fastq files or even of the processed files, just remember to replace any file extension with .srf to make the pipeline works properly.

The nomenclature is $name=sample.srf (e.g. $input=s_0.input.srf in case you have one srf for the sample, $input=input.rep1.srf,input.rep2.srf in case you have more than one).

 
Comparisons

For each pairwise comparison you must specify the “treated” sample for first and the “control” sample for second. The comparison is always performed in the direction treated versus control. If you want to perform the comparison both ways you must specify it as two different comparisons.

The nomenclature is ^treated_sample_name:control_sample_name. In case you want to do the comparison both ways, also add a row that specifies it, i.e ^control_sample_name:treated_sample_name.





3. Pipeline example


3.1 Configuration file: an example

You can find below an example of configuration file. The “treated” sample corresponds to GEO accession GSM487450 while “control” to GSM487453.

########################################
@debug_mode=yes
@working_dir=/scratch/tmp_bench/
@FC_dir=/home/gbarozzi/

[ANALYSIS]
@specie_name=mm9
@do_raw_conv=from_sra
@do_fastqc=yes
@do_alignment=yes
@do_peak_finding=yes
@do_peaks_annot=yes
@do_bigwig=yes
@do_tdf=yes
@do_copy_tracks=no
@do_tracks_to_db=no

[SAMPLES]
$PU1_UT=PU1_UT.srf
$input=input.srf

[COMPARISONS]
^PU1_UT:input

[RAW FILES CONVERSION & QC]
@srf2fastq=/home/gbarozzi/srf2fastq/srf2fastq
@fastqdump=/home/gbarozzi/sratoolkit/fastq-dump
@fastqc=/home/gbarozzi/FastQC/fastqc
@filter_low_quality_reads=no

[ALIGNMENT]
@bowtie_alignment_mode=v
@bowtie_v=2
@bowtie_m=1
@bowtie_n=
@bowtie_l=
@bowtie_e=
@bowtie_trim3=
@bowtie_quals=
@bowtie_color=
@bowtie_seed=
@bowtie_cpu=4
@bowtie_best=no
@bowtie_output_format=SAM
@bowtie_chr_dir=/db/bowtie/mm9/mm9
@bowtie=/home/gbarozzi/bowtie/bowtie

[PEAKS FINDING]
@macs_tsize=36
@macs_gsize=1.87e9
@macs_pvalue=1e-5
@macs_bw=300
@macs_shiftsize=
@macs_format=auto
@macs_nolamba=no
@macs_nomodel=no
@macs_create_wig=yes
@macs_space=10
@macs_mfold=10,30
@macs=/home/gbarozzi/MACS/bin/macs14

[ANNOTATIONS]
@gin_priority=gene
@gin_distance=-2500
@gin_genes_file=/home/gbarozzi/data/mm9_ucsc_genes.tab
@gin=/home/gbarozzi/GIN/GIN.pl
@merge=/home/gbarozzi/MERGE/merge.py
@pie_peaks_gd_distance=2500
@pie_peaks_gd=/home/gbarozzi/PIE_PEAKS_GD/pie_peaks_gd.sh
@tss_3utr_dist_genes_file=/home/gbarozzi/data/mm9_ucsc_genes.bed
@tss_3utr_dist=/home/gbarozzi/TSS_3UTR_dist/TSS_3UTR_dist
@wigtobigwig_chr_file=/home/gbarozzi/data/mm9_chr_len_plus_random.txt
@wigtobigwig=/home/gbarozzi/wigToBigWig/wigToBigWig
@igvtools_genome_file=/home/gbarozzi/igvtools/genomes/mm9.genome
@igvtools=/home/gbarozzi/igvtool/igvtools
########################################



3.2 Pipeline output files: an example

Here we simply list the content of the working folder after running the pipeline with the configuration file above. A comment line explaining their content precedes every file or every group of files.


# The configuration file
config.bench.txt

# The srf files
input.srf
PU1_UT.srf

# The fastq files
input.fastq
PU1_UT.fastq
 
# Fastqc results (folders)
input_fastqc/
PU1_UT_fastqc/
 
# The SAM files (alignment results)
input.SAM
PU1_UT.SAM
 
# MACS results
PU1_UT_input_peaks.bed
PU1_UT_input_peaks.xls
PU1_UT_input_summits.bed
PU1_UT_input_negative_peaks.xls
PU1_UT_input_treat_afterfiting_all.wig.gz
PU1_UT_input_control_afterfiting_all.wig.gz
PU1_UT_input_model.r
 
# GIN results
PU1_UT_input.gff
PU1_UT_input_peaks.gin
 
# Merge results
PU1_UT_input_peaks.merge
 
# PIE_PEAKS_GD results
PU1_UT_input.genomic_dist.pdf

# TSS_3UTR_DIST results
PU1_UT_input_TSS_distances.tab
PU1_UT_input_TSS.pdf
PU1_UT_input_TSS_5kbp.pdf
PU1_UT_input_3UTR_distances.tab
PU1_UT_input_3UTR.pdf
PU1_UT_input_3UTR_5kbp.pdf
 
# bigWig files
input.bigWig
PU1_UT.bigWig 
 
# tdf files
input.tdf
PU1_UT.tdf


 

4. Pipeline workflow scheme

The following scheme summarizes the workflow followed by the pipeline for a single pairwise comparison.







5. Tools details


 
5.1 srf2fastq

Brief description: short read format (srf) files to fastq converter
Download from: http://sourceforge.net/projects/staden/files/io_lib/
Version tested: io_lib 1.12.4
Notes: -


5.2 SRA Toolkit

Brief description: short read archive format (sra) files to fastq converter
Download from: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
Version tested: 2.1.6
Notes: -


5.3 FastQC

Brief description: quality control
Download from: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Version tested: 0.9.0
Notes: -


5.4 Bowtie

Brief description: short reads aligner
Download from: http://bowtie-bio.sourceforge.net/
Version tested: 0.12.6
Notes: -


5.5 MACS


Brief description: peak calling tool
Download from: http://liulab.dfci.harvard.edu/MACS/
Version tested: 1.4.0beta (patched)
Notes: we patched the 1.4.0beta version of MACS in order to correct some errors affecting the process of reading of the input files. In order to be installed in the proper way, it requires Python 2.6.6 or later to be present on your system.


5.6 GIN

Brief description: Gene Interval Notator annotates a list of genomic regions to a table of genomic features downloadable from the UCSC genome browser
Download from: included in the pipeline, original ref. [PMID 18945685]
Version tested: -
Notes: it requires Perl to be installed on your system


5.7 PIES_PEAKS_GD

Brief description: it draws pie charts according to the genomic annotations assigned by GIN
Download from: included in the pipeline
Version tested: -
Notes: it requires R and awk to be installed on your system
 

5.8 Merge

Brief description: -
Download from: included in the pipeline
Version tested: -
Notes: it requires Python to be installed on your system


5.9 TSS_3UTR_dist

Brief description: each region is re-annotated to the nearest TSS as well as to the nearest 3’ UTR; this information is then plot in histograms
Download from: included in the pipeline
Version tested: -
Notes: it requires R and g++ compiler to be properly installed on your system.


5.10 wigToBigWig

Brief description: conversion of wiggle to BigWig files
Download from: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
Version tested: download December, 7th 2010
Notes: -

5.11 IGV Tools

Brief description: conversion of wiggle to tdf files
Download from: http://www.broadinstitute.org/igv/igvtools
Version tested: 2.0.9
Notes: -