tsRFun

Users should firstly pre-process the raw sequencing data, including trimming the adapters and low quality reads, and change the data into collapsed fasta format following these steps.

Firstly, users can use Trim_galore or some other software to trim the adapters and low quality reads.

Then, users should collapse the clean fastq reads to fasta reads. The conversion can be done by the following code in Linux.

The collapsed fasta reads are required as input format, and the reads should be named by unique tags (e.g. “>seq_1_x1”. The name includes three parts, first part is “>seq”, the second part “1” represent the unique id, the last “x1” represent the count number of read, and “_” acts as a connector in the tag).

tRFfinder supports only inputs in FASTA format. FASTA is a plain-text format for representing DNA, RNA or protein sequences. Every nucleotide or amino acid is represented by single-letter. For details on FASTA format, please see FASTA Format in Wikipedia.

For tRFfinder, only FASTA fromat for RNA sequences is supported. There are mainly two ways to input your sequence:

It might take some time for tRFfinder to predict the tRFs. Please wait for a minute. On the "Predict Results" page, there are several important components.

tsRTarget will show the potential canonical and non-canonical tsRNA-mRNA interactions based on CLIP-Seq and CLASH data.

For CLIP-Seq data, we clustered the unique CLIP reads overlapped in genome reference, and ranked them by peak high. The observed peak height can reflect the binding affinity and RNA abundance.

For CLASH/CLEAR data, we used duplex reads mapping for ultra-short RNA strands to multi references, and allow the 1-2 mismatches because the potential contaminants during the crosslinking.

tRFinCancer provides expression landscapes of tsRNAs across mutiple types of cancer tissues. The usage is desribed below.

tsRFunciton helps users predict the functions of tsRNAs by performing GO enrichment analysis on their potential targets. The enrichment analysis will be determined by using a hypergeometric test and FDR correction.

RPM

Different sources of samples could vary in size and depth of sequencing. In order to normalize the samples for direct comparisons, the RPM method is adopted:

where C is the sum of reads mapped onto one particular position in the tRF, and \( N \) is the sum of reads having been mapped onto the tRNA genes.

p-value

To perform the screening, it is first assumed that fragmented RNAs are randomly distributed along the entire pre-tRNA; that is, the RNAs are uniformly distributed across the entire length of pre-tRNA. According to this assumption, we could conclude that, of the entire length of tRNA, the probability of one small-RNA fragment mapped onto one particular position in the tRNA is

where L is the length (in nucleotide, nt) of the tRNA, and l is the length (in nucleotide, nt) of the small RNA fragment being mapped onto the tRNA.

Therefore, the probability of more than k (inclusive) small RNA fragments mapped onto the same position in the tRNA follows the Binomial distribution, and the probability for this event is

where k is the observed counts of small RNA fragments mapped onto that particular position in the tRNA, and n is the total number of fragments mapped onto the entire tRNA.

If there are more than k (inclusive) small RNA fragments mapped onto one particular position in the tRNA, but the probability of this event occurring by chance (Eq. 2) is less than 1% (referred to the p-value, and can be adjusted to your satisfaction), then we could conclude that the assumption above is false (with 99% confidence, by default); i.e., this event does not occur by chance.

Note, however, that generally tRFs is of more than 16 nt in length. To take this fact into consideration, we should ensure that the tRFs candidate matches with tRNA sequence for consecutive more than 16 nt. (Mismatches, indels are allowed.) This, in turn, corresponds to the requirements that there are at least 16 nt bases in consecutive position in tRNA (which matches with the candidate tRFs sequences) should have a p-values less than 1%. (Figure 1)

Figure 1. Schematic demonstration of how the core algorithm works.

n is the sum of the reads mapped to the tRNA, k is the sum of the reads that are mapped to the particular position in the tRNA. l is the length of the reads, and L is the length of the tRNA. By default, a tRF candidate corresponds to more than 16 contiguous nucleotides with p-value(s) less than 0.01.

AlignScore

tRFfinder use a scoring scheme to handle different types of mismatches /indels. When the reads are mapped to the tRNA gene sequences and a tRF region is obtained, tRFfinder scores each site of the tRF region according to the following rules (Table 1)

Table 1. The scoring scheme

Cases on this site of a read	Score for this site of a read
Cases on this site of a read	Score for this site of a read
Perfect match	+1
mismatch or indel of modification sites	-0.5
mismatch or indel on other sites	-1

After each site of the tRF region is scored, tRFfinder sums up the scores to get a total score. To eliminate the effect of length on the total score (the longer the region, usually the higher the total score), the total score is divided by the length of the tRF region to get the alignment score for this region. By default, tRFfinder outputs only regions with alignment scores greater than 100 (this threshold can be set by users in the parameter option lists).

Figure 2. An example of the scoring scheme. The dotted line indicates the potential chemical modification site. When the reads are mapped to the tRNA gene sequences, each site of the tRF region is scored based on the mapped reads. If a mismatch or indel occurs on this site within some reads, a score of -0.5 per read is added to the total score of this site; if a perfect match occurs on this site within other reads, a score of +1 per read is added to the total score.

Manual Panel

Pre-process your data

RPM, P-Value, and AlignScore

tsRFun Manual