Package 'LncFinder' reference manual

Title:	LncRNA Identification and Analysis Using Heterologous Features
Description:	Long non-coding RNAs identification and analysis. Default models are trained with human, mouse and wheat datasets by employing SVM. Features are based on intrinsic composition of sequence, EIIP value (electron-ion interaction pseudopotential), and secondary structure. This package can also extract other classic features and build new classifiers. Reference: Han S., et al. (2019) <doi:10.1093/bib/bby065>.
Authors:	Siyu Han [aut, cre], Ying Li [aut], Yanchun Liang [aut]
Maintainer:	Siyu Han <[email protected]>
License:	GPL-3
Version:	1.1.6
Built:	2025-01-28 03:54:09 UTC
Source:	https://github.com/cran/LncFinder

Build Users' Own Model

Description

This function is used to build new models with users' own data.

Usage

build_model(
  lncRNA.seq,
  mRNA.seq,
  frequencies.file,
  SS.features = FALSE,
  lncRNA.format = "DNA",
  mRNA.format = "DNA",
  parallel.cores = 2,
  folds.num = 10,
  seed = 1,
  gamma.range = (2^seq(-5, 0, 1)),
  cost.range = c(1, 4, 8, 16, 24, 32),
  ...
)
build_model(
  lncRNA.seq,
  mRNA.seq,
  frequencies.file,
  SS.features = FALSE,
  lncRNA.format = "DNA",
  mRNA.format = "DNA",
  parallel.cores = 2,
  folds.num = 10,
  seed = 1,
  gamma.range = (2^seq(-5, 0, 1)),
  cost.range = c(1, 4, 8, 16, 24, 32),
  ...
)

Arguments

`lncRNA.seq`	Long non-coding sequences. Can be a FASTA file loaded by `seqinr-package` or secondary structure sequences file (Dot-Bracket Notation) obtained from function `run_RNAfold`. If `lncRNA.seq` is secondary structure sequences file, parameter `lncRNA.format` should be defined as `"SS"`.
`mRNA.seq`	mRNA sequences. FASTA file loaded by `read.fasta` or secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold`. If `mRNA.seq` is secondary structure sequences file, parameter `mRNA.format` should be defined as `"SS"`.
`frequencies.file`	String or a list obtained from function `make_frequencies`. Input species name `"human"`, `"mouse"` or `"wheat"` to use pre-build frequencies files. Or assign a users' own frequencies file (Please refer to function `make_frequencies` for more information).
`SS.features`	Logical. If `SS.features = TRUE`, secondary structure features will be used to build the model. In this case, `lncRNA.seq` and `mRNA.seq` should be secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold` and parameter `lncRNA.format` and `mRNA.format` should be set as `"SS"`.
`lncRNA.format`	String. Define the format of `lncRNA.seq`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences. Only when both `mRNA.format` and `lncRNA.format` are set as `"SS"`, can the model with secondary structure features be built (`SS.features = TRUE`).
`mRNA.format`	String. Define the format of `mRNA.seq`. Can be `"DNA"` or `"SS"`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences. When this parameter is defined as `"DNA"`, only the model without secondary structure features can be built. In this case, parameter `SS.features` should be set as `FALSE`.
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`, users can set as `-1` to run this function with all cores. During the process of svm tuning, if the number of `parallel.cores` is more than the `folds.num` (number of the folds for cross-validation), the number of `parallel.cores` will be set as `folds.num` automatically.
`folds.num`	Integer. Specify the number of folds for cross-validation. (Default: `10`)
`seed`	Integer. Used to set the seed for cross-validation. (Default: `1`)
`gamma.range`	The range of gamma. (Default: `2 ^ seq(-5, 0, 1)`)
`cost.range`	The range of cost. (Default: `c(1, 4, 8, 16, 24, 32)`)
`...`	Additional arguments passed to function `svm_tune` for customised SVM model training.

Details

This function is used to build a new model with users' own sequences. Users can use function lnc_finder to predict the sequences with new models.

For the details of frequencies.file, please refer to function make_frequencies.

For the details of the features, please refer to function extract_features.

For the details of svm tuning, please refer to function svm_tune.

Value

Returns a svm model.

References

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Build the model with pre-build frequencies.file:
my_model <- build_model(lncRNA.seq = Seqs[1:5], mRNA.seq = Seqs[6:10],
                        frequencies.file = "human", SS.features = FALSE,
                        lncRNA.format = "DNA", mRNA.format = "DNA",
                        parallel.cores = 2, folds.num = 2, seed = 1,
                        gamma.range = (2 ^ seq(-5, -1, 2)),
                        cost.range = c(2, 6, 12, 20))

### Users can use default values of gamma.range and cost.range to find the
### best parameters.
### Use your own frequencies file by assigning frequencies list to parameter
### "frequencies.file".

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Build the model with pre-build frequencies.file:
my_model <- build_model(lncRNA.seq = Seqs[1:5], mRNA.seq = Seqs[6:10],
                        frequencies.file = "human", SS.features = FALSE,
                        lncRNA.format = "DNA", mRNA.format = "DNA",
                        parallel.cores = 2, folds.num = 2, seed = 1,
                        gamma.range = (2 ^ seq(-5, -1, 2)),
                        cost.range = c(2, 6, 12, 20))

### Users can use default values of gamma.range and cost.range to find the
### best parameters.
### Use your own frequencies file by assigning frequencies list to parameter
### "frequencies.file".

## End(Not run)

Extract the EIIP-derived features

Description

This function can extract EIIP-derived features proposed by Han et al (2018).

Usage

compute_EIIP(
  Sequences,
  label = NULL,
  spectrum.percent = 0.1,
  quantile.probs = seq(0, 1, 0.25)
)
compute_EIIP(
  Sequences,
  label = NULL,
  spectrum.percent = 0.1,
  quantile.probs = seq(0, 1, 0.25)
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`spectrum.percent`	Numeric specifying the percentage of the sorted power spectrum that be used to calculate the quantile-based features. For example, if `spectrum.percent = 0.1`, the top 10% percent of the sorted power spectrum will be used to compute the quantiles.
`quantile.probs`	Numeric. The probabilities with values in [0,1].

Details

The function compute_EIIP can extract EIIP (electron-ion interaction pseudo-potential) features including: signal at 1/3 position (Signal.Peak), average power (Average.Power), signal to noise ratio (SNR), and quantile-based features of one specified percentage of the sorted power spectrum (e.g. 0%, 20%, 40%, 60%, 70%, 100% when quantile.probs = seq(0, 1, 0.2) and spectrum.percent = 0.1).

In method LncFinder, EIIP features includes Signal.Peak, SNR, 0% (Signal.Min), 25% (Singal.Q1, 50% Signal.Q2), and 75% (Signal.Max) of the top 10% sorted power spectrum, i.e. quantile.prob = seq(0, 1, 0.25) and spectrum.percent = 0.1.

Value

A dataframe including the EIIP-derived features.

References

Please cite the following paper, which originally proposed EIIP:

Lalović, Dragutin, and Veljko Veljković. The global average DNA base composition of coding regions may be determined by the electron-ion interaction potential. Biosystems, 1990, 23(4):311-316.

Author(s)

HAN Siyu

Examples

data(demo_DNA.seq)
Seqs <- demo_DNA.seq

EIIP_res <- compute_EIIP(Seqs, label = "NonCoding", spectrum.percent = 0.25,
                         quantile.probs = seq(0, 1, 0.25))

data(demo_DNA.seq)
Seqs <- demo_DNA.seq

EIIP_res <- compute_EIIP(Seqs, label = "NonCoding", spectrum.percent = 0.25,
                         quantile.probs = seq(0, 1, 0.25))

Compute Euclidean Distance

Description

This function can compute Euclidean Distance proposed by method LncFinder (Han et al. 2018). Euclidean Distance can be calculated on full sequence or the longest ORF region. The step and k of the sliding window can also be customized.

Usage

compute_EucDistance(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)
compute_EucDistance(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`referFreq`	a list obtained from function `make_referFreq`.
`k`	An integer that indicates the sliding window size. (Default: `6`)
`step`	Integer defaulting to `1` for the window step.
`alphabet`	A vector of single characters that specify the different character of the sequence. (Default: `alphabet = c("a", "c", "g", "t")`)
`on.ORF`	Logical. If `TRUE`, Euclidean Distance will be calculated on the longest ORF region. NOTE: If `TRUE`, the input has to be DNA sequences. (Default: `FALSE`)
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, Euclidean Distance will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can compute Euclidean Distance proposed by LncFinder (HAN et al. 2018). In LncFinder, two schemes are provided to calculate Euclidean Distance: 1) step = 3 and k = 6 on the longest ORF region; 2) step = 1 and k = 6 on full sequence. Using this function compute_EucDistance, both step, k, and calculated region (full sequence or ORF) can be customized to maximize its availability.

Value

A dataframe.

References

Author(s)

HAN Siyu

Examples

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 3,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

EucDistance <- compute_EucDistance(Sequences, label = "NonCoding", referFreq = referFreq,
                                   k = 6, step = 3, alphabet = c("a", "c", "g", "t"),
                                   on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 3,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

EucDistance <- compute_EucDistance(Sequences, label = "NonCoding", referFreq = referFreq,
                                   k = 6, step = 3, alphabet = c("a", "c", "g", "t"),
                                   on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

Compute Fickett TESTCODE Score

Description

This function can compute Fickett TESTCODE score of DNA sequences proposed by James W.Fickett (Fickett JW. 1982). Fickett TESTCODE score can be calculated on full sequence or the longest ORF region.

Usage

compute_FickettScore(
  Sequences,
  label = NULL,
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)
compute_FickettScore(
  Sequences,
  label = NULL,
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`on.ORF`	Logical. If `TRUE`, Fickett TESTCODE score will be calculated on the longest ORF region.
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, Fickett TESTCODE score will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can compute Fickett TESTCODE score proposed by James W.Fickett (Fickett JW. 1982). Fickett TESTCODE score is selected as feature by method CPAT (Wang et al. 2013) and CPC2 (Kang et al. 2017). In CPAT, Fickett TESTCODE score is calculated on the longest ORF region, but CPC2 calculates the score on full sequence. This function compute_FickettScore improves the CPAT's code and is capable of computing the score on the longest ORF region as well as full sequence.

Value

A dataframe.

References

James W.Fickett. Recognition of protein coding regions in DNA sequences. Nucleic Acids Research, 1982, 10(17):5303-5318.

Liguo Wang, Hyun Jung Park, Surendra Dasari, Shengqin Wang, JeanPierre Kocher & Wei Li. CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Research, 2013, 41(6):e74-e74.

Yu-Jian Kang, De-Chang Yang, Lei Kong, Mei Hou, Yu-Qi Meng, Liping Wei & Ge Gao. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Research, 2017, 45(W1):W12-W16.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

FickettScore <- compute_FickettScore(Seqs, label = NULL, on.ORF = TRUE,
                                     auto.full = TRUE, parallel.cores = 2)

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

FickettScore <- compute_FickettScore(Seqs, label = NULL, on.ORF = TRUE,
                                     auto.full = TRUE, parallel.cores = 2)

## End(Not run)

Calculate GC content

Description

This function can GC content of the input sequences.

Usage

compute_GC(
  Sequences,
  label = NULL,
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)
compute_GC(
  Sequences,
  label = NULL,
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`on.ORF`	Logical. If `TRUE`, GC content will be calculated on the longest ORF region.
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, GC content will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can basically compute GC content of DNA sequences: GC content = (nc + ng) / (na + nc + ng + nt). The function will ignored the ambiguous bases.

Value

A dataframe.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

gcContent <- compute_GC(Seqs, label = "NonCoding",on.ORF = TRUE,
                        auto.full = TRUE, parallel.cores = 2)

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

gcContent <- compute_GC(Seqs, label = "NonCoding",on.ORF = TRUE,
                        auto.full = TRUE, parallel.cores = 2)

## End(Not run)

Compute Hexamer Score

Description

This function can compute hexamer score proposed by method CPAT (Wang et al. 2013). Hexamer score can be calculated on full sequence or the longest ORF region. The step and k of the sliding window can also be customized.

Usage

compute_hexamerScore(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)
compute_hexamerScore(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`referFreq`	A list obtained from function `make_referFreq`.
`k`	An integer that indicates the sliding window size. (Default: `6`)
`step`	Integer defaulting to `1` for the window step.
`alphabet`	A vector of single characters that specify the different character of the sequence. (Default: `alphabet = c("a", "c", "g", "t")`)
`on.ORF`	Logical. If `TRUE`, hexamer score will be calculated on the longest ORF region. NOTE: If `TRUE`, the input has to be DNA sequences. (Default: `FALSE`)
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, hexamer score will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can compute hexamer score proposed by CPAT (Wang et al. 2013). In CPAT, hexamer score is calculated on the longest ORF region, and the step of the sliding window is 3 (i.e. step = 3). Hexamer means six adjoining bases, thus k = 6. But in function compute_hexamerScore, both step, k, and calculated region (full sequence or ORF) can be customized to maximize its availability.

Value

A dataframe.

References

Liguo Wang, Hyun Jung Park, Surendra Dasari, Shengqin Wang, JeanPierre Kocher, & Wei Li. CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Research, 2013, 41(6):e74-e74.

Author(s)

HAN Siyu

Examples

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 1,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

hexamerScore <- compute_hexamerScore(Sequences, label = "NonCoding", referFreq = referFreq,
                                     k = 6, step = 1, alphabet = c("a", "c", "g", "t"),
                                     on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 1,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

hexamerScore <- compute_hexamerScore(Sequences, label = "NonCoding", referFreq = referFreq,
                                     k = 6, step = 1, alphabet = c("a", "c", "g", "t"),
                                     on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

Compute k-mer Features

Description

This function can calculate the k-mer frequencies of the sequences.

Usage

compute_kmer(
  Sequences,
  label = NULL,
  k = 1:5,
  step = 1,
  freq = TRUE,
  improved.mode = FALSE,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)
compute_kmer(
  Sequences,
  label = NULL,
  k = 1:5,
  step = 1,
  freq = TRUE,
  improved.mode = FALSE,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`k`	An integer that indicates the sliding window size. (Default: `1:5`)
`step`	Integer defaulting to `1` for the window step.
`freq`	Logical. If TRUE, the frequencies of different patterns are returned instead of counts. (Default: `TRUE`)
`improved.mode`	Logical. If TRUE, the frequencies will be normalized using the method proposed by PLEK (Li et al. 2014). Ignored if `freq = FALSE`. (Default: `FALSE`)
`alphabet`	A vector of single characters that specify the different character of the sequence. (Default: `alphabet = c("a", "c", "g", "t")`)
`on.ORF`	Logical. If `TRUE`, the k-mer frequencies will be calculated on the longest ORF region. NOTE: If `TRUE`, the sequences have to be DNA. (Default: `FALSE`)
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, the k-mer frequencies will be calculated on the full sequence automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can extract k-mer features. k and step can be customized. The count (freq = FALSE) or frequencies (freq = TRUE) of different patterns can be returned. If freq = TRUE, improved.mode is available. The improved mode is proposed by method PLEK. (Ref: Li et al. 2014)

Value

A dataframe.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

kmer_res1 <- compute_kmer(Seqs, k = 1:5, step = 1, freq = TRUE, improved.mode = FALSE)

kmer_res2 <- compute_kmer(Seqs, k = 1:5, step = 3, freq = TRUE,
                          improved.mode = TRUE, on.ORF = TRUE, auto.full = TRUE)

## End(Not run)

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

kmer_res1 <- compute_kmer(Seqs, k = 1:5, step = 1, freq = TRUE, improved.mode = FALSE)

kmer_res2 <- compute_kmer(Seqs, k = 1:5, step = 3, freq = TRUE,
                          improved.mode = TRUE, on.ORF = TRUE, auto.full = TRUE)

## End(Not run)

Compute Logarithm Distance

Description

This function can compute Logarithm Distance proposed by method LncFinder (Han et al. 2018). Logarithm Distance can be calculated on full sequence or the longest ORF region. The step and k of the sliding window can also be customized.

Usage

compute_LogDistance(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)
compute_LogDistance(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`referFreq`	a list obtained from function `make_referFreq`.
`k`	An integer that indicates the sliding window size. (Default: `6`)
`step`	Integer defaulting to `1` for the window step.
`alphabet`	A vector of single characters that specify the different character of the sequence. (Default: `alphabet = c("a", "c", "g", "t")`)
`on.ORF`	Logical. If `TRUE`, Logarithm Distance will be calculated on the longest ORF region. NOTE: If `TRUE`, the input has to be DNA sequences. (Default: `FALSE`)
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, Logarithm Distance will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can compute Logarithm Distance proposed by LncFinder (HAN et al. 2018). In LncFinder, two schemes are provided to calculate Logarithm Distance: 1) step = 3 and k = 6 on the longest ORF region; 2) step = 1 and k = 6 on full sequence. Method LncFinder uses scheme 1 to extract Logarithm Distance features. Using this function compute_EucDistance, both step, k, and calculated region (full sequence or ORF) can be customized to maximize its availability.

Value

A dataframe.

References

Author(s)

HAN Siyu

Examples

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 3,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

LogDistance <- compute_LogDistance(Sequences, label = "NonCoding", referFreq = referFreq,
                                   k = 6, step = 3, alphabet = c("a", "c", "g", "t"),
                                   on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 3,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

LogDistance <- compute_LogDistance(Sequences, label = "NonCoding", referFreq = referFreq,
                                   k = 6, step = 3, alphabet = c("a", "c", "g", "t"),
                                   on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

Compute Theoretical Isoelectric Point

Description

This function is basically a wrapper for function computePI. This function translate DNA sequence into protein, and compute the theoretical isoelectric point (pI) of this protein.

Usage

compute_pI(
  Sequences,
  label = NULL,
  on.ORF = FALSE,
  auto.full = FALSE,
  ambiguous.base = FALSE,
  parallel.cores = 2
)
compute_pI(
  Sequences,
  label = NULL,
  on.ORF = FALSE,
  auto.full = FALSE,
  ambiguous.base = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`on.ORF`	Logical. If `TRUE`, pI will be calculated on the longest ORF region. NOTE: If `TRUE`, the input has to be DNA sequences. (Default: `FALSE`)
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, pI will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`ambiguous.base`	If `TRUE`, ambiguous bases are taken into account when translating DNA sequences into proteins.
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can compute the pI of DNA sequences. Method CPC2 (Kang et al. 2017) uses this feature to identify lncRNAs, and this feature is evaluated in the article LncFinder (Han et al. 2018).

Using this function, the theoretical pI can be computed on full sequence or the longest ORF region. In CPC2, pI is calculated on ORF region.

Value

A dataframe.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Sequences <- demo_DNA.seq

pI_res <- compute_pI(Sequences, on.ORF = TRUE, auto.full = FALSE, ambiguous.base = FALSE)

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Sequences <- demo_DNA.seq

pI_res <- compute_pI(Sequences, on.ORF = TRUE, auto.full = FALSE, ambiguous.base = FALSE)

## End(Not run)

A demo of dataset

Description

This dataset contains the features of 20 lncRNA sequences and 20 protein-coding sequences.

Usage

data(demo_dataset)
data(demo_dataset)

Format

A data frame with 40 rows and 20 variables:

Label: the class of the sequences
ORF.Max.Len: the length of the longest ORF
ORF.Max.Cov: the coverage of the longest ORF
Seq.lnc.Dist: Log-Distance.lncRNA
Seq.pct.Dist: Log-Distance.protein-coding transcripts
Seq.Dist.Ratio: Distance-Ratio.sequence
Signal.Peak: Signal as 1/3 position
SNR: Signal to noise ratio
Signal.Min: the minimum value of the top 10% power spectrum
Signal.Q1: the quantile Q1 of the top 10% power spectrum
Signal.Q2: the quantile Q2 of the top 10% power spectrum
Signal.Max: the maximum value of the top 10% power spectrum
Dot_lnc.dist: Log-Distance.acguD.lncRNA
Dot_pct.dist: Log-Distance.acguD.protein-coding transcripts
Dot_Dist.Ratio: Distance-Ratio.acguD
SS.lnc.dist: Log-Distance.acgu-ACGU.lncRNA
SS.pct.dist: Log-Distance.acgu-ACGU.protein-coding transcripts
SS.Dist.Ratio: Distance-Ratio.acgu-ACGU
MFE: Minimum free energy
UP.PCT: Percentage of Unpair-Pair

Source

Sequences are selected from GENCODE.

A demo of DNA sequences

Description

This file contains 10 DNA sequences.

Usage

data(demo_DNA.seq)
data(demo_DNA.seq)

Format

A list contains 10 DNA sequences.

The sequences are loaded by function read.fasta.

Source

DNA sequences are selected from GENCODE.

A demo of secondary structure sequences

Description

This file contains 10 SS (Secondary Structure) sequences.

Usage

data(demo_SS.seq)
data(demo_SS.seq)

Format

A data frame with 3 rows and 10 variables:

The first row is RNA sequence; the second row is Dot-Bracket Notation of secondary structure sequences; the last row is minimum free energy (MFE).

Source

DNA sequences are selected from GENCODE. Secondary structure of each sequence is obtained from program "RNAfold".

Extract the Features

Description

This function can construct the dataset. This function is only used to extract the features, please use function build_model to build new models.

Usage

extract_features(
  Sequences,
  label = NULL,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  parallel.cores = 2
)
extract_features(
  Sequences,
  label = NULL,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  parallel.cores = 2
)

Arguments

`Sequences`	mRNA sequences or long non-coding sequences. Can be a FASTA file loaded by `seqinr-package` or secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold`. If `Sequences` are secondary structure sequences file, parameter `format` should be defined as `"SS"`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`SS.features`	Logical. If `SS.features = TRUE`, secondary structure features will be extracted. In this case, `Sequences` should be secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold` and parameter `format` should be set as `"SS"`.
`format`	String. Can be `"DNA"` or `"SS"`. Define the format of `Sequences`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences. This parameter must be set as `"SS"` when `SS.features = TURE`.
`frequencies.file`	String or a list obtained from function `make_frequencies`. Input species name `"human"`, `"mouse"` or `"wheat"` to use pre-build frequencies files. Or assign a users' own frequencies file (See function `make_frequencies`).
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function extracts the features and constructs the dataset.

Considering that it is time consuming to obtain secondary structure sequences, users can build the model only with features of sequence and EIIP (SS.features = FALSE). When SS.features = TRUE, Sequences should be secondary structure sequences (Dot-Bracket Notation) obtained from function run_RNAfold and parameter format should be set as "SS".

Please note that:

Secondary structure features (SS.features) can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on other species, SS.features as TRUE may lead to low accuracy.

Value

Returns a data.frame. 11 features when SS.features is FALSE, and 19 features when SS.features is TRUE.

Features

1. Features based on sequence:

The length and coverage of the longest ORF (ORF.Max.Len and ORF.Max.Cov);

Log-Distance.lncRNA (Seq.lnc.Dist);

Log-Distance.protein-coding transcripts (Seq.pct.Dist);

Distance-Ratio.sequence (Seq.Dist.Ratio).

2. Features based on EIIP (electron-ion interaction pseudopotential) value:

Signal at 1/3 position (Signal.Peak);

Signal to noise ratio (SNR);

the minimum value of the top 10% power spectrum (Signal.Min);

the quantile Q1 and Q2 of the top 10% power spectrum (Singal.Q1 and Signal.Q2)

the maximum value of the top 10% power spectrum (Signal.Max).

3. Features based on secondary structure sequence:

Log-Distance.acguD.lncRNA (Dot_lnc.dist);

Log-Distance.acguD.protein-coding transcripts (Dot_pct.dist);

Distance-Ratio.acguD (Dot_Dist.Ratio);

Log-Distance.acgu-ACGU.lncRNA (SS.lnc.dist);

Log-Distance.acgu-ACGU.protein-coding transcripts (SS.pct.dist);

Distance-Ratio.acgu-ACGU (SS.Dist.Ratio);

Minimum free energy (MFE);

Percentage of Unpair-Pair (UP.PCT)

References

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Extract features with pre-build frequencies.file:
my_features <- extract_features(Seqs, label = "Class.of.the.Sequences",
                                SS.features = FALSE, format = "DNA",
                                frequencies.file = "mouse",
                                parallel.cores = 2)

### Use your own frequencies file by assign frequencies list to parameter
### "frequencies.file".

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Extract features with pre-build frequencies.file:
my_features <- extract_features(Seqs, label = "Class.of.the.Sequences",
                                SS.features = FALSE, format = "DNA",
                                frequencies.file = "mouse",
                                parallel.cores = 2)

### Use your own frequencies file by assign frequencies list to parameter
### "frequencies.file".

## End(Not run)

Find ORFs

Description

This function can find all the ORFs in one sequence.

Usage

find_orfs(OneSeq, reverse.strand = FALSE, max.only = TRUE)
find_orfs(OneSeq, reverse.strand = FALSE, max.only = TRUE)

Arguments

`OneSeq`	Is one sequence. Can be a FASTA file read by package "seqinr" `seqinr-package` or just a string.
`reverse.strand`	Logical. Whether find ORF on the reverse strand. Default: `FALSE`
`max.only`	Logical. If `TRUE`, only the longest ORF will be returned. Default: `TRUE`

Details

This function can extract ORFs of one sequence. It returns ORF region, length and coverage of the longest ORF when max.only = TRUE or ORF region, start position, end position, length and coverage of all the ORFs when max.only = FALSE. Coverage is the the ratio of the ORF to transcript length. If reverse.strand = TRUE, ORF will also be found on reverse strand.

Value

If max.only = TRUE, the function returns a list which consists the ORF region (ORF.Max.Seq), length (ORF.Max.Len) and coverage (ORF.Max.Cov) of the longest ORF. If max.only = FALSE, the function returns a dataframe which consists all the ORF sequences.

Author(s)

HAN Siyu

Examples

### For one sequence:
OneSeq <- c("cccatgcccagctagtaagcttagcc")
orf.info_1 <- find_orfs(OneSeq, reverse.strand = TRUE, max.only = FALSE)

### For a FASTA file contains several sequences:
## Not run: 
### Use "read.fasta" function of package "seqinr" to read a FASTA file:
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

## End(Not run)

### Or just try to use our data "demo_DNA.seq"
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Use apply function to find the longest ORF:
orf.info_2 <- sapply(Seqs, find_orfs, reverse.strand = FALSE, max.only = FALSE)
### For one sequence:
OneSeq <- c("cccatgcccagctagtaagcttagcc")
orf.info_1 <- find_orfs(OneSeq, reverse.strand = TRUE, max.only = FALSE)

### For a FASTA file contains several sequences:
## Not run: 
### Use "read.fasta" function of package "seqinr" to read a FASTA file:
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

## End(Not run)

### Or just try to use our data "demo_DNA.seq"
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Use apply function to find the longest ORF:
orf.info_2 <- sapply(Seqs, find_orfs, reverse.strand = FALSE, max.only = FALSE)

Long Non-coding RNA Identification

Description

This function is used to predict sequences are non-coding transcripts or protein-coding transcripts.

Usage

lnc_finder(
  Sequences,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  svm.model = "human",
  parallel.cores = 2
)
lnc_finder(
  Sequences,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  svm.model = "human",
  parallel.cores = 2
)

Arguments

`Sequences`	Unevaluated sequences. Can be a FASTA file loaded by `seqinr-package` or secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold`. If `Sequences` is secondary structure sequences file, parameter `format` should be defined as `"SS"`.
`SS.features`	Logical. If `SS.features = TRUE`, secondary structure features will be used.
`format`	String. Define the format of the `Sequences`. Can be `"DNA"` or `"SS"`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences.
`frequencies.file`	String or a list obtained from function `make_frequencies`. Input species name `"human"`, `"mouse"` or `"wheat"` to use pre-build frequencies files. Or assign a users' own frequencies file (See function `make_frequencies`).
`svm.model`	String or a svm model obtained from function `build_model` or `svm_tune`. Input species name `"human"`, `"mouse"` or `"wheat"` to use pre-build models. Or assign a users' own model (See function `build_model`).
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

Considering that it is time consuming to obtain secondary structure sequences, users can input nucleotide sequences and predict these sequences without secondary structure features (Set SS.features as FALSE).

Please note that:

SS.features can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on other species, SS.features may lead to low accuracy.

For the details of frequencies.file, please refer to function make_frequencies.

For the details of the features, please refer to function extract_features.

Value

Returns a data.frame. Including the results of prediction (Pred); coding potential (Coding.Potential) and the features. For the details of the features, please refer to function extract_features.

References

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Input one sequence:
OneSeq <- Seqs[1]
result_1 <- lnc_finder(OneSeq, SS.features = FALSE, format = "DNA",
                       frequencies.file = "human", svm.model = "human",
                       parallel.cores = 2)

### Or several sequences:
data(demo_SS.seq)
Seqs <- demo_SS.seq
result_2 <- lnc_finder(Seqs, SS.features = TRUE, format = "SS",
                       frequencies.file = "mouse", svm.model = "mouse",
                       parallel.cores = 2)

### A complete work flow:
### Calculate second structure on Windows OS,
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)

### Predict the sequences with secondary structure features,
result_2 <- lnc_finder(SS.seq, SS.features = TRUE, format = "SS",
                       frequencies.file = "mouse", svm.model = "mouse",
                       parallel.cores = 2)

### Predict sequences with your own model by assigning a new svm.model and
### frequencies.file to parameters "svm.model" and "frequencies.file"

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Input one sequence:
OneSeq <- Seqs[1]
result_1 <- lnc_finder(OneSeq, SS.features = FALSE, format = "DNA",
                       frequencies.file = "human", svm.model = "human",
                       parallel.cores = 2)

### Or several sequences:
data(demo_SS.seq)
Seqs <- demo_SS.seq
result_2 <- lnc_finder(Seqs, SS.features = TRUE, format = "SS",
                       frequencies.file = "mouse", svm.model = "mouse",
                       parallel.cores = 2)

### A complete work flow:
### Calculate second structure on Windows OS,
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)

### Predict the sequences with secondary structure features,
result_2 <- lnc_finder(SS.seq, SS.features = TRUE, format = "SS",
                       frequencies.file = "mouse", svm.model = "mouse",
                       parallel.cores = 2)

### Predict sequences with your own model by assigning a new svm.model and
### frequencies.file to parameters "svm.model" and "frequencies.file"

## End(Not run)

Make the frequencies file for new classifier construction

Description

This function is used to calculate the frequencies of lncRNAs, CDs, and secondary structure sequences. The frequencies file can be used to build the classifier using function extract_features. Functions make_frequencies and extract_features are useful when users are trying to build their own model.

NOTE: Function make_frequencies makes the frequencies file for building the classifiers of LncFinder method. If users need to calculate Logarithm-Distance, Euclidean-Distance, and hexamer score, the frequencies file need to be computed using function make_referFreq.

Usage

make_frequencies(
  cds.seq,
  mRNA.seq,
  lncRNA.seq,
  SS.features = FALSE,
  cds.format = "DNA",
  lnc.format = "DNA",
  check.cds = TRUE,
  ignore.illegal = TRUE
)
make_frequencies(
  cds.seq,
  mRNA.seq,
  lncRNA.seq,
  SS.features = FALSE,
  cds.format = "DNA",
  lnc.format = "DNA",
  check.cds = TRUE,
  ignore.illegal = TRUE
)

Arguments

`cds.seq`	Coding sequences (mRNA without UTRs). Can be a FASTA file loaded by `seqinr-package` or secondary structure sequences (Dot-Bracket Notation) obtained form function `run_RNAfold`. CDs are used to calculate hexamer frequencies of nucleotide sequences,thus secondary structure is not needed. Parameter `cds.format` should be `"SS"` when input is secondary structure sequences. (See details for more information.)
`mRNA.seq`	mRNA sequences with Dot-Bracket Notation. The secondary structure sequences can be obtained from function `run_RNAfold`. mRNA sequences are used to calculate the frequencies of acgu-ACGU and a acguD (see details), thus, mRNA sequences are required only when `SS.features = TRUE`.
`lncRNA.seq`	Long non-coding RNA sequences. Can be a FASTA file loaded by `seqinr-package` or secondary structure sequences (Dot-Bracket Notation) obtained from function `run_RNAfold`. If `SS.features = TRUE`, `lncRNA.seq` must be RNA sequences with secondary structure sequences and parameter `lnc.format` should be defined as `"SS"`.
`SS.features`	Logical. If `SS.features = TRUE`, frequencies of secondary structure will also be calculated and the model can be built with secondary structure features. In this case, `mRNA.seq` and `lncRNA.seq` should be secondary structure sequences.
`cds.format`	String. Define the format of the sequences of `cds.seq`. Can be `"DNA"` or `"SS"`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences.
`lnc.format`	String. Define the format of lncRNAs (`lncRNA.seq`). Can be `"DNA"` or `"SS"`. `"DNA"` for DNA sequences and `"SS"` for secondary structure sequences. This parameter must be defined as `"SS"` when `SS.features = TURE`.
`check.cds`	Logical. Incomplete CDs can lead to a false shift and a inaccurate hexamer frequencies. When `check.cds = TRUE`, hexamer frequencies will be calculated on the longest ORF. This parameter is strongly recommended to set as `TRUE` when mRNA is used as CDs.
`ignore.illegal`	Logical. If `TRUE`, the sequences with non-nucleotide characters (nucleotide characters: "a", "c", "g", "t") will be ignored when calculating hexamer frequencies.

Details

This function is used to make frequencies file for LncFinder method. This file is needed when users are trying to build their own model.

In order to achieve high accuracy, mRNA should not be regarded as CDs and assigned to parameter cds.seq. However, CDs of some species may be insufficient for calculating frequencies, and mRNAs can be regarded as CDs with parameter check.cds = TRUE. In this case, hexamer frequencies will be calculated on ORF region.

Considering that it is time consuming to obtain secondary structure sequences, users can only provide nucleotide sequences and build a model without secondary structure features (SS.features = FALSE). If users want to build a model with secondary structure features, parameter SS.features should be set as TRUE. At the same time, the format of the sequences of mRNA.seq and lnc.seq should be secondary structure sequences (Dot-Bracket Notation). Secondary structure sequences can be obtained by function run_RNAfold.

Please note that:

SS.features can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on other species, SS.features may lead to low accuracy.

The frequencies file consists three groups: Hexamer Frequencies; acgu-ACGU Frequencies and acguD Frequencies.

Hexamer Frequencies are calculated on the original nucleotide sequences by employing k-mer scheme (k = 6), and the sliding window will slide 3 nt each step.

For any secondary structure sequences (Dot-Bracket Notation), if one position is a dot, the corresponding nucleotide of the RNA sequence will be replaced with character "D". acguD Frequencies are the k-mer frequencies (k = 4) calculated on this new sequences.

Similarly, for any secondary structure sequences (Dot-Bracket Notation), if one position is "(" or ")", the corresponding nucleotide of the RNA sequence will be replaced with upper case ("A", "C", "G", "U").

A brief example,

DNA Sequence: 5'- t a c a g t t a t g -3'

RNA Sequence: 5'- u a c a g u u a u g -3'

Dot-Bracket Sequence: 5'- . . . . ( ( ( ( ( ( -3'

acguD Sequence: { D, D, D, D, g, u, u, a, u, g }

acgu-ACGU Sequence: { u, a, c, a, G, U, U, A, U, G }

Value

Returns a list which consists the frequencies of protein-coding sequences and non-coding sequences.

References

Author(s)

HAN Siyu

Examples

### Only for examples:
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

## Not run: 
### Obtain the secondary structure sequences (Windows OS):
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)

### Make frequencies file with secondary strucutre features,
my_file_1 <- make_frequencies(cds.seq = SS.seq, mRNA.seq = SS.seq,
                              lncRNA.seq = SS.seq, SS.features = TRUE,
                              cds.format = "SS", lnc.format = "SS",
                              check.cds = TRUE, ignore.illegal = FALSE)

## End(Not run)

### Make frequencies file without secondary strucutre features,
my_file_2 <- make_frequencies(cds.seq = Seqs, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "DNA",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)

### The input of cds.seq and lncRNA.seq can also be secondary structure
### sequences when SS.features = FALSE, such as,
data(demp_SS.seq)
SS.seq <- demo_SS.seq
my_file_3 <- make_frequencies(cds.seq = SS.seq, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "SS",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)
### Only for examples:
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

## Not run: 
### Obtain the secondary structure sequences (Windows OS):
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)

### Make frequencies file with secondary strucutre features,
my_file_1 <- make_frequencies(cds.seq = SS.seq, mRNA.seq = SS.seq,
                              lncRNA.seq = SS.seq, SS.features = TRUE,
                              cds.format = "SS", lnc.format = "SS",
                              check.cds = TRUE, ignore.illegal = FALSE)

## End(Not run)

### Make frequencies file without secondary strucutre features,
my_file_2 <- make_frequencies(cds.seq = Seqs, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "DNA",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)

### The input of cds.seq and lncRNA.seq can also be secondary structure
### sequences when SS.features = FALSE, such as,
data(demp_SS.seq)
SS.seq <- demo_SS.seq
my_file_3 <- make_frequencies(cds.seq = SS.seq, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "SS",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)

Make Frequencies File for Log.Dist, Euc.Dist, and hexamer score

Description

This function is used to calculate the frequencies of lncRNAs and CDs. The Frequencies file can be used to calculate Logarithm-Distance (compute_LogDistance), Euclidean-Distance (compute_EucDistance), and hexamer score (compute_hexamerScore).

NOTE: If users need to make frequencies file to build new LncFinder classifier using function extract_features, please refer to function make_frequencies.

Usage

make_referFreq(
  cds.seq,
  lncRNA.seq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.orf = TRUE,
  ignore.illegal = TRUE
)
make_referFreq(
  cds.seq,
  lncRNA.seq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.orf = TRUE,
  ignore.illegal = TRUE
)

Arguments

`cds.seq`	Coding sequences (mRNA without UTRs). Can be a FASTA file loaded by `seqinr-package`.
`lncRNA.seq`	Long non-coding RNA sequences. Can be a FASTA file loaded by `seqinr-package`.
`k`	An integer that indicates the sliding window size. (Default: `6`)
`step`	Integer defaulting to `1` for the window step.
`alphabet`	A vector of single characters that specify the different character of the sequence. (Default: `alphabet = c("a", "c", "g", "t")`)
`on.orf`	Logical. Incomplete CDs can lead to a false shift and a inaccurate hexamer frequencies. When `on.orf = TRUE`, the frequencies will be calculated on the longest ORF. This parameter is strongly recommended to set as `TRUE` when mRNA is used as CDs. Only available when `alphabet = c("a", "c", "g", "t")`. (Default: `TRUE`)
`ignore.illegal`	Logical. If `TRUE`, the sequences with non-nucleotide characters (nucleotide characters: "a", "c", "g", "t") will be ignored when calculating the frequencies. Only available when `alphabet = c("a", "c", "g", "t")`. (Default: `TRUE`)

Details

This function is used to make frequencies file for the computation of Logarithm-Distance (compute_LogDistance), Euclidean-Distance (compute_EucDistance), and hexamer score (compute_hexamerScore).

In order to achieve high accuracy, mRNA should not be regarded as CDs and assigned to parameter cds.seq. However, CDs of some species may be insufficient for calculating frequencies. In that case, mRNAs can be regarded as CDs with parameter on.orf = TRUE, and the frequencies will be calculated on ORF region. If on.orf = TRUE, users can set step = 3 to simulate the translation process.

Value

Returns a list which consists the frequencies of protein-coding sequences and non-coding sequences.

References

Author(s)

HAN Siyu

Examples

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 1,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

## End(Not run)

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 1,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

## End(Not run)

Read Secondary Structure Information

Description

This function can read secondary structure information from your own file instead of obtaining from function run_RNAfold. This function will be useful if users have had secondary structure sequences (Dot-Bracket Notation).

Usage

read_SS(
  oneFile.loc,
  seqRNA.loc,
  seqSS.loc,
  separateFile = TRUE,
  withMFE = TRUE
)
read_SS(
  oneFile.loc,
  seqRNA.loc,
  seqSS.loc,
  separateFile = TRUE,
  withMFE = TRUE
)

Arguments

`oneFile.loc`	String. The location of your sequence file. This file should contains one (and only one) RNA sequence and its secondary structure sequence in Dot-Bracket Notation. This parameter needs to be defined only when `separateFile = FALSE`. See Details for more information.
`seqRNA.loc`	String. The location of your RNA sequences file (FASTA format). If your RNA sequences and secondary structure sequences are in two files, you need to define the locations of two files respectively. And the files with multiple sequences are supported for this option. This parameter needs to be defined only when `separateFile` is `TRUE`. Location of secondary structure sequences file is also needed (parameter `seqSS.loc`). See Details for more information.
`seqSS.loc`	String. The location of your secondary structure sequences file (FASTA format).
`separateFile`	Logical. Your RNA sequence(s) and secondary structure sequence(s) are in separate files? If `separateFile = FALSE`, your file should have one (and only one) RNA sequence and its secondary structure sequence. No limit when `separateFile = TRUE`.
`withMFE`	Logical. Whether MFE is provided at the end of secondary structure sequence. If `withMFE = TRUE`, MFE will be extracted. The format should be in accordance with the output format of RNAfold.

Details

When users want to predict sequences with secondary structure features, users may have had their own secondary structure sequences. With this function, users can read SS information from their files. Two kind of files are supported: RNA sequence and SS sequence in one file separateFile is FALSE or in separate files separateFile = TRUE.

separateFile = FALSE is used for secondary structure that obtained from some popular programs, such as RNAfold. In this case, the output file only contains one RNA sequence and its SS. Besides, this file only have two rows: RNA sequence and its SS sequences. Thus, this option is more favorable when the file only have one sequence and the sequence are in accordance with the output format of RNAfold.

If users obtained the SS sequence from experiments, RNA sequence and SS sequence may be in two files. In this case, users can select separateFile = TRUE. Two files should be in FASTA format and one file can have multiple sequences. The sequences in two files should have the same order. If your data are obtained from experiments or other sources, it is highly recommended that users should build new model with this data, since the SS sequences of pre-built model are obtained for RNAfold and may have many differences with experimental data.

Value

A dataframe. The first row is RNA sequence, the second row is Dot-Bracket Notation of secondary structure sequence, the third row is MFE (if MFE is provided).

Author(s)

HAN Siyu

Examples

## Not run: 
### Load sequence data
data("demo_DNA.seq")
Seqs <- demo_DNA.seq[1:4]
### Convert sequences from vector to string.
Seqs <- sapply(Seqs, seqinr::getSequence, as.string = TRUE)
### Write a fasta file.
seqinr::write.fasta(Seqs, names = names(Seqs), file.out = "tmp.RNA.fa", as.string = TRUE)

### For Windows system: (Your path of RNAfold.)
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
### Define the parameters of RNAfold. See documents of RNAfold for more information.
RNAfold.command <- paste(RNAfold.path, " --noPS -i tmp.RNA.fa -o output", sep = "")
### Run RNAfold and output four result files.
system(RNAfold.command)

### Read secondary structure information for one file.
result_1 <- read_SS(oneFile.loc = "output_ENST00000510062.1.fold",
                    separateFile = FALSE, withMFE = TRUE)
### Read secondary sturcture sequences for multiple files.
filePath <- dir(pattern = ".fold")
result_2 <- sapply(filePath, read_SS, separateFile = FALSE, withMFE = TRUE)
result_2 <- as.data.frame(result_2)

## End(Not run)

## Not run: 
### Load sequence data
data("demo_DNA.seq")
Seqs <- demo_DNA.seq[1:4]
### Convert sequences from vector to string.
Seqs <- sapply(Seqs, seqinr::getSequence, as.string = TRUE)
### Write a fasta file.
seqinr::write.fasta(Seqs, names = names(Seqs), file.out = "tmp.RNA.fa", as.string = TRUE)

### For Windows system: (Your path of RNAfold.)
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
### Define the parameters of RNAfold. See documents of RNAfold for more information.
RNAfold.command <- paste(RNAfold.path, " --noPS -i tmp.RNA.fa -o output", sep = "")
### Run RNAfold and output four result files.
system(RNAfold.command)

### Read secondary structure information for one file.
result_1 <- read_SS(oneFile.loc = "output_ENST00000510062.1.fold",
                    separateFile = FALSE, withMFE = TRUE)
### Read secondary sturcture sequences for multiple files.
filePath <- dir(pattern = ".fold")
result_2 <- sapply(filePath, read_SS, separateFile = FALSE, withMFE = TRUE)
result_2 <- as.data.frame(result_2)

## End(Not run)

Obtain the Secondary Structure Sequences Using RNAfold

Description

This function can compute secondary structure sequences. The tool "RNAfold" of software "ViennaRNA" is required for this function.

Usage

run_RNAfold(Sequences, RNAfold.path = "RNAfold", parallel.cores = 2)
run_RNAfold(Sequences, RNAfold.path = "RNAfold", parallel.cores = 2)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`RNAfold.path`	String. Indicate the path of the program "RNAfold". By default is `"RNAfold"` for UNIX/Linux system. (See details.)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`, users can set as `-1` to run this function with all cores.

Details

This function is used to compute secondary structure. The output of this function can be used in function make_frequencies, extract_features, build_model and lnc_finder when parameter SS.features is set as TRUE.

This function depends on the program "RNAfold" of software "ViennaRNA". (http://www.tbi.univie.ac.at/RNA/index.html)

Parameter RNAfold.path can be simply defined as "RNAfold" as default when the OS is UNIX/Linux. However, for some OS, such as Windows, users need to specify the RNAfold.path if the path of "RNAfold" haven't been added in environment variables.

This function can print the related information when the OS is UNIX/Linux, such as:

"25 of 100, length: 695 nt",

which means around 100 sequences are assigned to this node and the program is computing the 25th sequence. The length of this sequence is 695nt.

If users have their own SS data, users can use function read_SS to load them, instead of obtaining from RNAfold.

Value

Returns data.frame. The first row is RNA sequence; the second row is Dot-Bracket Notation of secondary structure sequences; the last row is minimum free energy (MFE).

Author(s)

HAN Siyu

Examples

## Not run: 
### For a FASTA file contains several sequences,
### Use "read.fasta" function of package "seqinr" to read a FASTA file:
Seqs <- read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

### Or just try to use our data "demo_DNA.seq"
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Windows:
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq_1 <- run_RNAfold(Seqs[1:2], RNAfold.path = RNAfold.path, parallel.cores = 2)

### For UNIX/Linux, "RNAfold.path" can be just defined as "RNAfold" as default:
SS.seq_2 <- run_RNAfold(Seqs, RNAfold.path = "RNAfold", parallel.cores = 2)

## End(Not run)
## Not run: 
### For a FASTA file contains several sequences,
### Use "read.fasta" function of package "seqinr" to read a FASTA file:
Seqs <- read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

### Or just try to use our data "demo_DNA.seq"
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Windows:
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq_1 <- run_RNAfold(Seqs[1:2], RNAfold.path = RNAfold.path, parallel.cores = 2)

### For UNIX/Linux, "RNAfold.path" can be just defined as "RNAfold" as default:
SS.seq_2 <- run_RNAfold(Seqs, RNAfold.path = "RNAfold", parallel.cores = 2)

## End(Not run)

k-fold Cross Validation for SVM

Description

This function conduct k-fold Cross Validation for SVM.

Usage

svm_cv(
  dataset,
  label.col = 1,
  positive.class = NULL,
  folds.num = 10,
  seed = 1,
  parallel.cores = 2,
  ...
)
svm_cv(
  dataset,
  label.col = 1,
  positive.class = NULL,
  folds.num = 10,
  seed = 1,
  parallel.cores = 2,
  ...
)

Arguments

`dataset`	The dataset obtained from function `extract_features`. Or datasets used to build the classifier.
`label.col`	integer specifying the column number of the label. (Default: `1`)
`positive.class`	Character. Indicate the positive class of the dataset. (Default: `NonCoding`) The value of this parameter should be identical to one of the classes of the response vectors.
`folds.num`	Integer. Specify the number of folds for cross-validation. (Default: `10`)
`seed`	Integer. Used to set the seed for cross-validation. (Default: `1`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`, users can set as `-1` to run this function with all cores. If the number of `parallel.cores` is more than the `folds.num` (number of the folds for cross-validation), the number of `parallel.cores` will be set as `folds.num` automatically.
`...`	additional parameters for function `svm`.

Details

During the model tuning, the performance of each combination of parameters will output. Sensitivity, Specificity, Accuracy, F-Measure and Kappa Value are used to evaluate the performances. The best gamma and cost (or best model) are selected based on Accuracy.

For the details of parameter gamma and cost, please refer to function svm of package "e1071".

For the details of metrics, please refer to function confusionMatrix of package "caret".

Value

Returns the optimal parameters when return.model = FALSE. Or returns the best model when return.model = TRUE.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_dataset)
my_dataset <- demo_dataset

cv_res <- svm_cv(my_dataset, folds.num = 4, seed = 1,
                 parallel.core = 2, cost = 3, kernel = "radial", gamma = 0.5)

### Users can set return.model = TRUE to return the best model.

## End(Not run)
## Not run: 
data(demo_dataset)
my_dataset <- demo_dataset

cv_res <- svm_cv(my_dataset, folds.num = 4, seed = 1,
                 parallel.core = 2, cost = 3, kernel = "radial", gamma = 0.5)

### Users can set return.model = TRUE to return the best model.

## End(Not run)

Parameter Tuning of SVM

Description

This function conduct the parameter tuning of SVM. Parameters gamma and cost can be tuned using grid search.

Usage

svm_tune(
  dataset,
  label.col = 1,
  positive.class = "NonCoding",
  folds.num = 10,
  seed = 1,
  gamma.range = (2^seq(-5, 0, 1)),
  cost.range = c(1, 4, 8, 16, 24, 32),
  return.model = TRUE,
  parallel.cores = 2,
  ...
)
svm_tune(
  dataset,
  label.col = 1,
  positive.class = "NonCoding",
  folds.num = 10,
  seed = 1,
  gamma.range = (2^seq(-5, 0, 1)),
  cost.range = c(1, 4, 8, 16, 24, 32),
  return.model = TRUE,
  parallel.cores = 2,
  ...
)

Arguments

`dataset`	The dataset obtained from function `extract_features`. Or datasets used to build the classifier.
`label.col`	integer specifying the column number of the label. (Default: `1`)
`positive.class`	Character. Indicate the positive class of the dataset. (Default: `NonCoding`) The value of this parameter should be identical to one of the classes of the response vectors.
`folds.num`	Integer. Specify the number of folds for cross-validation. (Default: `10`)
`seed`	Integer. Used to set the seed for cross-validation. (Default: `1`)
`gamma.range`	The range of gamma. (Default: `2 ^ seq(-5, 0, 1)`)
`cost.range`	The range of cost. (Default: `c(1, 4, 8, 16, 24, 32)`)
`return.model`	Logical. If `TRUE`, the function will return the best model trained on the full dataset. If `FALSE`, this function will return the optimal parameters.
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`, users can set as `-1` to run this function with all cores. If the number of `parallel.cores` is more than the `folds.num` (number of the folds for cross-validation), the number of `parallel.cores` will be set as `folds.num` automatically.
`...`	Additional arguments for function `svm`, except `scale`, `probability`, `kernel`, `gamma` and `cost`.

Details

For the details of parameter gamma and cost, please refer to function svm of package "e1071".

For the details of metrics, please refer to function confusionMatrix of package "caret".

Value

Returns the optimal parameters when return.model = FALSE. Or returns the best model when return.model = TRUE.

Author(s)

HAN Siyu

Examples

## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

positive_data <- extract_features(Seqs[1:5], label = "NonCoding",
                                  SS.features = FALSE, format = "DNA",
                                  frequencies.file = "human",
                                  parallel.cores = 2)

negative_data <- extract_features(Seqs[6:10], label = "Coding",
                                  SS.features = FALSE, format = "DNA",
                                  frequencies.file = "human",
                                  parallel.cores = 2)

my_dataset <- rbind(positive_data, negative_data)

### Or use our data "demo_dataset"
data(demo_dataset)
my_dataset <- demo_dataset

optimal_parameter <- svm_tune(my_dataset, positive.class = "NonCoding",
                              folds.num = 2, seed = 1,
                              gamma.range = (2 ^ seq(-5, 0, 2)),
                              cost.range = c(1, 8, 16),
                              return.model = FALSE, parallel.core = 2)

### Users can set return.model = TRUE to return the best model.

## End(Not run)
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

positive_data <- extract_features(Seqs[1:5], label = "NonCoding",
                                  SS.features = FALSE, format = "DNA",
                                  frequencies.file = "human",
                                  parallel.cores = 2)

negative_data <- extract_features(Seqs[6:10], label = "Coding",
                                  SS.features = FALSE, format = "DNA",
                                  frequencies.file = "human",
                                  parallel.cores = 2)

my_dataset <- rbind(positive_data, negative_data)

### Or use our data "demo_dataset"
data(demo_dataset)
my_dataset <- demo_dataset

optimal_parameter <- svm_tune(my_dataset, positive.class = "NonCoding",
                              folds.num = 2, seed = 1,
                              gamma.range = (2 ^ seq(-5, 0, 2)),
                              cost.range = c(1, 8, 16),
                              return.model = FALSE, parallel.core = 2)

### Users can set return.model = TRUE to return the best model.

## End(Not run)

Package 'LncFinder'

Help Index

Build Users' Own Model

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

Extract the EIIP-derived features

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

Compute Euclidean Distance

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

Compute Fickett TESTCODE Score

Description

Usage

Arguments

Details

Value

References

Author(s)

Examples

Calculate GC content

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Compute Hexamer Score

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

Compute k-mer Features

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Compute Logarithm Distance

Description

Usage

Arguments

Details

Value

References

Author(s)

See Also

Examples

Compute Theoretical Isoelectric Point

Description