Re: [問題] 常規表達式找副詞 banyhong PTT批踢踢實業坊

Re: [問題] 常規表達式找副詞

作者: banyhong (=_=) 2015-12-08 22:34:28

※ 引述《yuseke (yuseke)》之銘言：
: as title
: 這兩天在看RE的部分
: 根據PYTHON 基金會的網站:
: https://docs.python.org/2/library/re.html
: 7.2.5.7. Finding all Adverbs and their Positions¶
: If one wants more information about all matches of a pattern than the matched
: text, finditer() is useful as it provides instances of MatchObject instead of
: strings. Continuing with the previous example, if one was a writer who wanted
: to find all of the adverbs and their positions in some text, he or she would
: use finditer() in the following manner:
: ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
: 07-16: carefully
: 40-47: quickly
: 關於""""for m in re.finditer(r"\w+ly", text):""""
: 這個部份我有一個疑問,
: 可是並不是所有的adv都有ly字尾.......
: 這種情形該怎麼處理呢?
在自然語言處理中，一個常用的方法是建機率模型
就是統計連續二個或三個POS出現的組合
假設訓練資料中，(AT, NN)後面出現的機率最高是IN
則遇到(AT,NN,??)的時候，就填NN
一般會用trigram tagger，以三個POS為主
如果test data出現沒見過的POS組合，就查bigram tagger
如果再查不到，就一律視為NN
下面是用brown corpus作訓練資料，作一個簡單的tagger
也可以參考下面的網頁
http://www.nltk.org/book/ch05.html
import nltk
train = nltk.corpus.brown.tagged_sents()
train[0]
# [(u'The', u'AT'),
# (u'Fulton', u'NP-TL'),
# (u'County', u'NN-TL'),
# (u'Grand', u'JJ-TL'),
# (u'Jury', u'NN-TL'),
# (u'said', u'VBD'),
# (u'Friday', u'NR'),
# (u'an', u'AT'),
# (u'investigation', u'NN'),
# (u'of', u'IN'),
# (u"Atlanta's", u'NP$'),
# (u'recent', u'JJ'),
# (u'primary', u'NN'),
# (u'election', u'NN'),
# (u'produced', u'VBD'),
# (u'``', u'``'),
# (u'no', u'AT'),
# (u'evidence', u'NN'),
# (u"''", u"''"),
# (u'that', u'CS'),
# (u'any', u'DTI'),
# (u'irregularities', u'NNS'),
# (u'took', u'VBD'),
# (u'place', u'NN'),
# (u'.', u'.')]
# backoff是遇到沒遇過的組合時，要作的動作
default_tagger = nltk.DefaultTagger('NN')
bi_tagger = nltk.BigramTagger(train, backoff=default_tagger)
tri_tagger = nltk.TrigramTagger(train, backoff=bi_tagger)
# 用訓練出的tagger測試
tri_tagger.tag('John is a dog'.split())
# [('John', u'NP'), ('is', u'BEZ'), ('a', u'AT'), ('dog', 'NN')]

作者: CaptainH (Cannon) 2015-12-09 03:06:00

這用crf或maxent比較好吧？或乾脆直接用stanford nlp的

作者: banyhong (=_=) 2015-12-09 12:23:00

https://goo.gl/KC3Bak 這裡有講解一些不同的tagger其實N-gram tagger就可以達到90% accuracyN-gram tagger也有容易實作而且直觀的好處

作者: bibo9901 (function(){})() 2015-12-10 02:48:00

印象中英文pos已經做到95%~97%了

繼續閱讀

[問題] 常規表達式找副詞yuseke [問題] 安裝模組問題ihaveaids [問題] 請問如何把資料匯入sqlitemichaelaa [教學] 90分鐘初寫網路爬蟲pycontw [問題] 如何佈署Django作直播?Rkaimo [問題] class問題almaplty [問題] tkinter GUI包成一個exead20scott [心得] BenchmarkResolaQQ [問題] 部署Django伺服器hung0724 [分享] Python's Web Framework BenchmarksKeySabre