Re: [問題] 正則表示式 regex in R

作者: cywhale (cywhale)   2016-04-30 23:51:48
※ 引述《celestialgod (天)》之銘言:
: ※ 引述《cywhale (cywhale)》之銘言:
: : [問題類型]:
: : 程式諮詢(我想用R 做某件事情,但是我不知道要怎麼用R 寫出來)
: : 若一字串的開頭與結尾只想留下英文字,我寫
: : gsub("^[^a-zA-Z]+|[^a-zA-Z]+$", "", x)
: : 但若結尾是"sp." or "spp." 我想保留"." 這個符號不被上面這個式子濾掉
: : 比如 "aaa bbb sp." 就維持原字串
: : 但其他情況的"."應該要被濾掉 比如 "aaa bbb22." -> "aaa bbb"
: : 試了一些?: ?! 等語法都沒抓到,向大家請教~~ 謝謝~
: str <- c("aaa bbb sp.", "aaa bbb sp2.")
: gsub("[^a-zA-Z]*([a-zA-Z. ]+).*", "\\1", str)
: ^ 這個空格要留著 不然會出事XD
: # [1] "aaa bbb sp." "aaa bbb sp"
: 我忘了問 會不會有 "aa2 bb3 cc." 要變成 "aa bb cc." 這種情況了?
: 有這種情況建議用regmatches,把 "aa", "bb", "cc."都抓出來,再處理QQ
: 大概像這樣(可能考慮還不夠周延):
: str <- c("aaa bbb sp.", "aaa bbb sp2.", "aa2 bb3 cc.")
: sapply(regmatches(str, gregexpr("[a-zA-Z. ]+", str)), function(x){
: paste0(x[x != "."], collapse = "")
: })
: # [1] "aaa bbb sp." "aaa bbb sp" "aa bb cc."
From previous post (thanks celestialgod), I learned "\\1" and got some idea..
So I tried and made the following code.
The results closed to my targets, to simplify some scientific names collected
from web. Those formats were just in a mess. ><
After these trials, learned a lot for handling regex... ^_^
gsub("^[^a-zA-Z]+|(?!\\.)[^a-zA-Z]+$|
\\b((sp\\.)+$)|\\b((spp\\.)+$)|((\\w{0,})\\.+$)","\\2\\4\\6",
c("33aaa sp.", "aaa sp.bb33", "aaasp.bb 33 de","aaa w2sp.",
"aaa www spp. ", "spp.","bb.", "XXX sp. ",
"YYY spp.()", "ZZZZ.."), perl=T)
[1] "aaa sp." "aaa sp.bb" "aaasp.bb 33 de" "aaa w2sp" "aaa www spp."
[6] "spp." "bb" "XXX sp." "YYY spp." "ZZZZ"
Any comments or bugs found, just tell me! Thanks for the help~
作者: celestialgod (天)   2016-04-30 23:55:00
這個regex真的好醜XDD
作者: cywhale (cywhale)   2016-05-01 00:01:00
haha.. really.. @@

Links booklink

Contact Us: admin [ a t ] ucptt.com