Having a fasta file containing protein sequences like these two showing below, I would like to count how many times the amino acid A appears in each sequence.
>sp|P01920|DQB1_HUMAN HLA class II histocompatibility antigen, DQ beta 1 chain OS=Homo sapiens OX=9606 GN=HLA-DQB1 PE=1 SV=2
MSWKKALRIPGGLRAATVTLMLAMLSTPVAEGRDSPEDFVYQFKAMCYFTNGTERVRYVT
RYIYNREEYARFDSDVEVYRAVTPLGPPDAEYWNSQKEVLERTRAELDTVCRHNYQLELR
TTLQRRVEPTVTISPSRTEALNHHNLLVCSVTDFYPAQIKVRWFRNDQEETTGVVSTPLI
RNGDWTFQILVMLEMTPQHGDVYTCHVEHPSLQNPITVEWRAQSESAQSKMLSGIGGFVL
GLIFLGLGLIIHHRSQKGLLH
>sp|P18440|ARY1_HUMAN Arylamine N-acetyltransferase 1 OS=Homo sapiens OX=9606 GN=NAT1 PE=1 SV=2
MDIEAYLERIGYKKSRNKLDLETLTDILQHQIRAVPFENLNIHCGDAMDLGLEAIFDQVV
RRNRGGWCLQVNHLLYWALTTIGFETTMLGGYVYSTPAKKYSTGMIHLLLQVTIDGRNYI
VDAGFGRSYQMWQPLELISGKDQPQVPCVFRLTEENGFWYLDQIRREQYIPNEEFLHSDL
LEDSKYRKIYSFTLKPRTIEDFESMNTYLQTSPSSVFTSKSFCSLQTPDGVHCLVGFTLT
HRRFNYKDNTDLIEFKTLSEEEIEKVLKNIFNISLQRKLVPKHGDRFFTI
This code
library(seqinr)
data <- read.fasta(file = "yourlist.fasta", as.string = TRUE)
library(stringr)
ACount <- stri_count_regex("A",data)
gives the result showing on the picture.
Although the character A excists in both sequences they are not counted. Any ideas on why is this happening? Thank you for your interest.
There seem to be some mistakes on your code. Following your procedure, this worked fine by me:
library(seqinr)
data <- read.fasta(file = "yourlist.fasta", seqtype = "AA", as.string = TRUE, set.attributes = FALSE)
library(stringi)
ACount <- stri_count_regex(data, "A")
You have to specify with the seqtype
argument the type of sequence, being "DNA" the default. You have to change it to "AA" (protein).
The stri_count_regex
function is part of the stringi
base R package.
I get now:
> str(ACount)
int [1:2] 14 7