NLP基本业务范围
admin
2024-03-04 02:15:44
0

1,文本纠错(query纠错),可用于爬取的新闻资讯等进行预处理,去掉错别字、可用于搜索业务中query词纠错、可用于对话中的智能改错。中文相关的纠错paper

两个指标:过纠率(FAR,也就误报率),召回率

过纠率:正确的句子被改错的比率(FAR=正确句子被错纠的个数/正确句子个数);召回率:错误的句子被全部纠正的比率。较大的过纠率将会对系统和用户体验带来负面效果。因而,纠对句子数量远远大于被改错句子的数量,如果句子出错概率是K,则K*RECALL>>(1-K)*FAR。

github1,并不是说有了huggingface就可以拿来用,关键是在此基础上进行修改和设计模型,fine-tune,这才是水平和能力。

2,掩蔽词masked language model

from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("what a [MASK] blog.")
[{'score': 0.32585206627845764, 'token': 2307, 'token_str': 'great', 'sequence': 'what a great blog.'}, {'score': 0.04082264378666878, 'token': 3376, 'token_str': 'beautiful', 'sequence': 'what a beautiful blog.'}, {'score': 0.040087465196847916, 'token': 8403, 'token_str': 'lovely', 'sequence': 'what a lovely blog.'}, {'score': 0.03804076090455055, 'token': 6919, 'token_str': 'wonderful', 'sequence': 'what a wonderful blog.'}, {'score': 0.028535649180412292, 'token': 2204, 'token_str': 'good', 'sequence': 'what a good blog.'}]

获得sentence embedding,当然也有sentence Transform那个库,这里也可以的。

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "what a great blog I have seen."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
output.pooler_output#(1,768)
output.last_hidden_state.shape
torch.Size([1, 12, 768])

3,实体词识别NER

先说分词,这是基本操作,模型是上面的uncased,wordpiece级别分词

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = "what a great blog I have seen in China!"
tokens = tokenizer.tokenize(sentence)
>>> tokens
['what', 'a', 'great', 'blog', 'i', 'have', 'seen', 'in', 'china', '!']
#全部改成了小写
ids=tokenizer.convert_tokens_to_ids(tokens)
>>> ids
[2054, 1037, 2307, 9927, 1045, 2031, 2464, 1999, 2859, 999]
strings=tokenizer.decode(ids)
>>> strings
'what a great blog i have seen in china!'

采用cased模型,当然其中的ids也不同,因为训练数据不同,那么词表也不同,故而ids不同

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
。。。
>>> tokens
['what', 'a', 'great', 'blog', 'I', 'have', 'seen', 'in', 'China', '!']
>>> ids
[1184, 170, 1632, 10679, 146, 1138, 1562, 1107, 1975, 106]
>>> strings
'what a great blog I have seen in China!'

NER 实体识别就是实体分类(entity token classification)

types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

位置,组织,人名,杂乱的,这4个是实体识别的重点,又具体分为:

OOutside of a named entity
B-MISBeginning of a miscellaneous entity right after another miscellaneous entity
I-MISMiscellaneous entity
B-PERBeginning of a person’s name right after another person’s name
I-PERPerson’s name
B-ORGBeginning of an organization right after another organization
I-ORGorganization
B-LOCBeginning of a location right after another location
I-LOCLocation

总之,前加B就是Beginning,I就是本体,I可以去掉,O就是啥都不是,非实体。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipelinetokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Xiaoming and I live in Beijing, I love you forver !"
ner_results = nlp(example)
print(ner_results)
[{'entity': 'B-PER', 'score': 0.9986733, 'index': 4, 'word': 'Xiao', 'start': 11, 'end': 15}, {'entity': 'I-PER', 'score': 0.9260187, 'index': 5, 'word': '##ming', 'start': 15, 'end': 19}, {'entity': 'B-LOC', 'score': 0.99963117, 'index': 10, 'word': 'Beijing', 'start': 34, 'end': 41}]

替换成bert-base-cased模型,结果都是Label-0,和label-1,这种0/1分类误差太大,意义不大。

当然也有较为基础的model,直接来用 (如果没有实体词,则为[])

from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """I'm not the best but the great in recommondation system at Beijing, now I'm looking forward to your kind reply about the offer, if you have any question about job or my work, please contact me without hesitation, I will give you answer in time. 
Any other problem you can join in the QQ group 277356808 !"""
>>> ner_pipe(sequence)
[{'entity': 'I-LOC', 'score': 0.99978286, 'index': 17, 'word': 'Beijing', 'start': 59, 'end': 66}]

4,情感分析(sentiment analysis),发现仅仅就是分类问题,就是分词后对其中具有positive,negative的词进行标记,分类,这就是情感分析,仅此而已。

斯坦福NLP2013,NLTK

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love you forver !")
>>> result
[{'label': 'POSITIVE', 'score': 0.9998611211776733}]

稍微复杂点的code,分类也多了1个中性

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax
def preprocess(text):new_text = []for t in text.split(" "):t = '@user' if t.startswith('@') and len(t) > 1 else tt = 'http' if t.startswith('http') else tnew_text.append(t)return " ".join(new_text)
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
#model.save_pretrained(MODEL)
text = "what a great blog I have ever seen in China!"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):l = config.id2label[ranking[i]]s = scores[ranking[i]]print(f"{i+1}) {l} {np.round(float(s), 4)}")1) positive 0.9859
2) neutral 0.0116
3) negative 0.0025

5,文本摘要生成summarization,参考1

from transformers import AutoModelWithLMHead, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-summarize-news")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-summarize-news")
def summarize(text, max_length=150):input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True)generated_ids = model.generate(input_ids=input_ids, num_beams=2, max_length=max_length,  repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]return preds[0]
>>> summarize('After the sound and the fury, weeks of demonstrations and anguished calls for racial justice, the man whose death gave rise to an international movement, and whose last words — “I can’t breathe” — have been a rallying cry, will be laid to rest on Tuesday at a private funeral in Houston.George Floyd, who was 46, will then be buried in a grave next to his mother’s.The service, scheduled to begin at 11 a.m. at the Fountain of Praise church, comes after five days of public memorials in Minneapolis, North Carolina and Houston and two weeks after a Minneapolis police officer was caught on video pressing his knee into Mr. Floyd’s neck for nearly nine minutes before Mr. Floyd died. That officer, Derek Chauvin, has been charged with second-degree murder and second-degree manslaughter. His bail was set at $1.25 million in a court appearance on Monday. The outpouring of anger and outrage after Mr. Floyd’s death — and the speed at which protests spread from tense, chaotic demonstrations in the city where he died to an international movement from Rome to Rio de Janeiro — has reflected the depth of frustration borne of years of watching black people die at the hands of the police or vigilantes while calls for change went unmet.', 80)"at a private funeral in Houston on Tuesday. Floyd, who was 46, died of multiple organ failure last month. A Minnesota police officer was caught on video pressing his knee into Mr. Floyd’s neck for nearly nine minutes before he died. A Minneapolis police officer has been charged with second-degree murder and manslaughter. Floyd's bail was set at $1"

愿我们终有重逢之时,而你还记得我们曾经讨论的话题 

相关内容

热门资讯

linux入门---制作进度条 了解缓冲区 我们首先来看看下面的操作: 我们首先创建了一个文件并在这个文件里面添加了...
C++ 机房预约系统(六):学... 8、 学生模块 8.1 学生子菜单、登录和注销 实现步骤: 在Student.cpp的...
A.机器学习入门算法(三):基... 机器学习算法(三):K近邻(k-nearest neigh...
数字温湿度传感器DHT11模块... 模块实例https://blog.csdn.net/qq_38393591/article/deta...
有限元三角形单元的等效节点力 文章目录前言一、重新复习一下有限元三角形单元的理论1、三角形单元的形函数(Nÿ...
Redis 所有支持的数据结构... Redis 是一种开源的基于键值对存储的 NoSQL 数据库,支持多种数据结构。以下是...
win下pytorch安装—c... 安装目录一、cuda安装1.1、cuda版本选择1.2、下载安装二、cudnn安装三、pytorch...
MySQL基础-多表查询 文章目录MySQL基础-多表查询一、案例及引入1、基础概念2、笛卡尔积的理解二、多表查询的分类1、等...
keil调试专题篇 调试的前提是需要连接调试器比如STLINK。 然后点击菜单或者快捷图标均可进入调试模式。 如果前面...
MATLAB | 全网最详细网... 一篇超超超长,超超超全面网络图绘制教程,本篇基本能讲清楚所有绘制要点&#...
IHome主页 - 让你的浏览... 随着互联网的发展,人们越来越离不开浏览器了。每天上班、学习、娱乐,浏览器...
TCP 协议 一、TCP 协议概念 TCP即传输控制协议(Transmission Control ...
营业执照的经营范围有哪些 营业执照的经营范围有哪些 经营范围是指企业可以从事的生产经营与服务项目,是进行公司注册...
C++ 可变体(variant... 一、可变体(variant) 基础用法 Union的问题: 无法知道当前使用的类型是什...
血压计语音芯片,电子医疗设备声... 语音电子血压计是带有语音提示功能的电子血压计,测量前至测量结果全程语音播报࿰...
MySQL OCP888题解0... 文章目录1、原题1.1、英文原题1.2、答案2、题目解析2.1、题干解析2.2、选项解析3、知识点3...
【2023-Pytorch-检... (肆十二想说的一些话)Yolo这个系列我们已经更新了大概一年的时间,现在基本的流程也走走通了,包含数...
实战项目:保险行业用户分类 这里写目录标题1、项目介绍1.1 行业背景1.2 数据介绍2、代码实现导入数据探索数据处理列标签名异...
记录--我在前端干工地(thr... 这里给大家分享我在网上总结出来的一些知识,希望对大家有所帮助 前段时间接触了Th...
43 openEuler搭建A... 文章目录43 openEuler搭建Apache服务器-配置文件说明和管理模块43.1 配置文件说明...