N-gram简单有效的统计语言模型

1512 views

目录:

1 N-gram统计语言模型
2 补充

1 N-gram统计语言模型

通常n采用1-3之间的值，它们分别称为unigram、bigram和trigram

计算demo: 现有给定训练语料合计三个文档如下：

D1： John read Moby Dick
D2： Mary read a different book,
D3： She read a book by Cher

利用bigram求出句子“John read a book”的概率大约是

2-gram公式 P(s1,s2,s3...) = P(s1)*P(s2|s1)*P(s3|s2).....
ans

john在文章开头的概率：P（john） = 1/3
P（read | John） = 1
P(a|read) = 2/3
P(book|a) = 1/2
P(尾巴|book) = 1/2, book出现两次，其中一次是在句子结尾处
P("John read a book") = 1/3 * 1 * 2/3 * 1/2 * 1/2 = 1/18 ≈ 0.06

2 补充

unigram,bigram,trigram,是自然语言处理（NLP）中的问题

unigram: 单个word
bigram: 双word
trigram:3 word

比如：西安交通大学

unigram 形式为：西/安/交/通/大/学
bigram 形式为： 西安/安交/交通/通大/大学
trigram 形式为：西安交/安交通/交通大/通大学