The main procedures using Juicer are in here, but we need to write some linux codes to modify the data format that satisfy the one in Juicer. Of course, the best way is just to directly learn shell coding, but if you only need some comments, these codes below might satisfy your needs. This post will be updated from time to time since there are a lot of codes useful in this procedure.

awk '{print 1, $1, $2, 0, 1, $3, $4, 1}' file > temp && mv temp file
awk '{if ($2 > $6){ print $5, $6, $7, $8, $1, $2, $3, $4}else {print}}' file > output
sort -k2,2d -k6,6d output > temp && mv temp output

“awk” is used to change the column of the dataset, and “sort” is used if the order of chromosomes are different. “rm” means remove the file. “mv” means copy the file to another one.

The first two lines are used because sometimes files obtained doesn’t contain strands and restriction site fragments.

The last three line is used because the same chromosomes should be in the same block, and the order of columns should be corresponding to what is shown in Juicer. The if condition is used because the first chromosome should be “smaller” than the second, that is chr1 should be before chr2.

zcat ~/*.gz > ~/merged.txt

#below, the first and second line is to select specific files to merge, and the third line is to aggregate bin-pairs.

#$9 is the cell type, $1 is the name of files.
#awk -F '\t' '{if($9 == "Neuron"){print $1}}' inputsummaryfile > outputlist

#xargs -d'\n' cat<outputlist >>merged_neuron.txt

awk -F '\t' '{a[$1"\t"$2"\t"$3"\t"$4"\t"]+= $5} END{for (i in a) print i, a[i]}' merged_neuron.txt > temp

rm -rf merged_neuron.txt
mv temp merged_neuron.txt

“zcat” can directly obtain gz files and concatenate them in one. It is only occasionally used in Hi-C processing, but in general, it is quite useful.

hicConvertFormat -m * --inputFormat * --outputFormat * -o * --resolutions *

hicConvertFormat is a useful tool in HiCExplorer where we can change the format of the contact matrix. In Hi-C analysis, there isn’t a universal format, which is quite problematic.

-m represents the input matrix, –inputFormat can be cool, hic, homer, h5, HiCPro, and –outputFormat can be mcool, ginteractions, cool, h5. -o is the output matrix, –resolutions can be 20000, 40000, 70000, 120000, 500000, etc. If you only have a .txt file, you can also use HiCPeaks to transform it to cooler format.

toCooler -O K562-MboI-parts.cool -d datasets --assembly hg38 --nproc 1

Extract specific files in tar file:

tar -xvf test.tar.gz -C ./test2/ --wildcards --no-anchored '*.hic'

Where -C means the output folder.

Do parallel running using shell:

N=20 #Number of processes
for i in `seq 1 100000`
do
  ((k=k%N)); ((k++==0)) && wait
  Rscript runsomething.R &
 done

Put data from one server to another (quite fast!):

rsync -av --progress --partial-dir=/home/sshen82 --files-from=filelist.txt / sshen82@submit3.chtc.wisc.edu:/home/sshen82/input/

In R, a fast way to aggregate the matrix by a certain vector is as: sapply(split.data.frame(m, v),colMeans), where m is the matrix, and v is the vector.

In windows, the step below is needed for some reason.

sed -i -e 's/\r$//' xxx.sh