介紹
NLP 技術(shù)最有用的應(yīng)用之一是從非結(jié)構(gòu)化文本(合同、財務(wù)文件、醫(yī)療記錄等)中提取信息,它支持自動數(shù)據(jù)查詢以獲得新的見解。傳統(tǒng)上,命名實體識別已被廣泛用于識別文本中的實體并存儲數(shù)據(jù)以進(jìn)行高級查詢和過濾。但是,如果我們想從語義上理解非結(jié)構(gòu)化文本,僅靠 NER 是不夠的,因為我們不知道實體之間是如何關(guān)聯(lián)的。執(zhí)行聯(lián)合 NER 和關(guān)系提取將通過知識圖開辟一種全新的信息檢索方式,您可以在其中導(dǎo)航不同的節(jié)點以發(fā)現(xiàn)隱藏的關(guān)系。因此,聯(lián)合執(zhí)行這些任務(wù)將是有益的。
在我之前的文章的基礎(chǔ)上,我們使用 spaCy 3 為 NER 微調(diào)了 BERT 模型,現(xiàn)在我們將使用 spaCy 的新 Thinc 庫將關(guān)系提取添加到管道中。我們按照spaCy 文檔中概述的步驟訓(xùn)練關(guān)系提取模型。我們將比較使用轉(zhuǎn)換器和 tok2vec 算法的關(guān)系分類器的性能。最后,我們將在網(wǎng)上找到的職位描述上測試該模型。
關(guān)系分類
從本質(zhì)上講,關(guān)系提取模型是一個分類器,用于預(yù)測給定實體對{e1, e2}的關(guān)系r。在轉(zhuǎn)換器的情況下,這個分類器被添加到輸出隱藏狀態(tài)的頂部。
我們要微調(diào)的預(yù)訓(xùn)練模型是基于 roberta 的模型,但您可以使用 Hugging Face 庫中可用的任何預(yù)訓(xùn)練模型,只需在配置文件中輸入名稱(見下文)。
在本教程中,我們將提取兩個實體 {Experience, Skills} 之間的關(guān)系作為Experience_in和 {Diploma, Diploma_major} 之間的關(guān)系作為degree_in。目標(biāo)是提取特定技能所需的多年經(jīng)驗以及與所需文憑相關(guān)的文憑專業(yè)。當(dāng)然,您可以為自己的用例訓(xùn)練自己的關(guān)系分類器,例如在健康記錄或財務(wù)文件中的公司收購中查找癥狀的原因/影響。可能性是無限的……
在本教程中,我們將只介紹實體關(guān)系提取部分。使用spaCy 3對BERT NER進(jìn)行微調(diào),請參考我之前的文章。
數(shù)據(jù)標(biāo)注
在這里我們使用UBIAI文本注釋工具來執(zhí)行聯(lián)合實體和關(guān)系注釋,因為它的通用接口允許我們在實體和關(guān)系注釋之間輕松切換(見下文):
UBIAI 的聯(lián)合實體和關(guān)系標(biāo)注接口。在本教程中,我只注釋了大約 100 個包含實體和關(guān)系的文檔。對于生產(chǎn),我們肯定需要更多帶注釋的數(shù)據(jù)。
數(shù)據(jù)準(zhǔn)備
在訓(xùn)練模型之前,我們需要將帶注釋的數(shù)據(jù)轉(zhuǎn)換為二進(jìn)制 spacy 文件。我們首先將 UBIAI 生成的 annotation 拆分為 training/dev/test 并分別保存。我們修改spaCy 的教程存儲庫中提供的代碼,為我們自己的注釋(轉(zhuǎn)換代碼)創(chuàng)建二進(jìn)制文件。
我們對訓(xùn)練、開發(fā)和測試數(shù)據(jù)集重復(fù)此步驟以生成三個二進(jìn)制 spacy 文件(Github 中提供的文件)。
關(guān)系抽取模型訓(xùn)練
對于訓(xùn)練,我們將提供黃金語料庫中的實體,并在這些實體上訓(xùn)練分類器。
- 打開一個新的 Google Colab 項目,并確保在筆記本設(shè)置中選擇 GPU 作為硬件加速器。確保GPU是通過運(yùn)行啟用:!nvidia-smi。
- 安裝spacy-nightly:
!pip install -U spacy-nightly --pre
- 安裝wheel包并克隆 spacy 的關(guān)系提取 repo:
!pip install -U pip setuptools wheel
!python -m spacy project clone tutorials/rel_component
- 安裝變壓器管道和spacy transformers庫:
!python -m spacy download en_core_web_trf
!pip install -U spacy transformers
- 將目錄更改為 rel_component 文件夾:cd rel_component.
- 在 rel_component 中創(chuàng)建一個名為“data”的文件夾,并將訓(xùn)練、開發(fā)和測試二進(jìn)制文件上傳到其中:
- 打開 project.yml 文件并更新訓(xùn)練、開發(fā)和測試路徑:
train_file: "data/relations_training.spacy"dev_file: "data/relations_dev.spacy"test_file: "data/relations_test.spacy"
- 您可以通過轉(zhuǎn)到 configs/rel_trf.cfg 并輸入模型名稱來更改預(yù)訓(xùn)練的轉(zhuǎn)換器模型(例如,如果您想使用不同的語言):
[components.transformer.model]@architectures = "spacy-transformers.TransformerModel.v1"name = "roberta-base" # Transformer model from huggingfacetokenizer_config = {"use_fast": true}
- 在開始訓(xùn)練之前,我們將max_lengthconfigs/rel_trf.cfg 中的 configs/rel_trf.cfg 從默認(rèn)的 100 個令牌減少到 20 個以提高我們模型的效率。max_length 對應(yīng)于兩個實體之間的最大距離,如果超過該距離,它們將不會被考慮用于關(guān)系分類。因此,來自同一文檔的兩個實體將被分類,只要它們彼此之間的最大距離(以標(biāo)記數(shù)計)即可。
[components.relation_extractor.model.create_instance_tensor.get_instances] @misc = "rel_instance_generator.v1" max_length = 20
- 我們終于準(zhǔn)備好訓(xùn)練和評估關(guān)系提取模型了;只需運(yùn)行以下命令:
!spacy project run train_gpu # command to train transformers
!spacy project run evaluate # command to evaluate on test dataset
您應(yīng)該開始看到 P、R 和 F 分?jǐn)?shù)得到更新:
模型完成訓(xùn)練后,對測試數(shù)據(jù)集的評估將立即開始并顯示預(yù)測與黃金標(biāo)簽。該模型將與我們模型的分?jǐn)?shù)一起保存在名為“training”的文件夾中。
要訓(xùn)??練非變壓器模型tok2vec,請改為運(yùn)行以下命令:
!spacy project run train_cpu # command to train train tok2vec
!spacy project run evaluate
我們可以比較兩個模型的性能:
# Transformer model
"performance":{"rel_micro_p":0.8476190476,"rel_micro_r":0.9468085106,"rel_micro_f":0.8944723618,}
# Tok2vec model
"performance":{"rel_micro_p":0.8604651163,"rel_micro_r":0.7872340426,"rel_micro_f":0.8222222222,}
基于transformer的模型的準(zhǔn)確率和召回分?jǐn)?shù)明顯優(yōu)于tok2vec,并證明了transformers在處理少量帶注釋數(shù)據(jù)時的有用性。
聯(lián)合實體和關(guān)系提取管道
假設(shè)我們已經(jīng)像我之前的帖子一樣訓(xùn)練了一個 Transformer NER 模型,我們將從在線找到的工作描述中提取實體(這不是訓(xùn)練的一部分,也不是開發(fā)集的一部分),并將它們提供給關(guān)系提取模型以對關(guān)系。
- 安裝空間變壓器和變壓器管道。
- 加載 NER 模型并提取實體:
import spacynlp = spacy.load("NER Model Repo/model-best")Text=['''2+ years of non-internship professional software development experience
Programming experience with at least one modern language such as Java, C++, or C# including object-oriented design.1+ years of experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems.Bachelor / MS Degree in Computer Science. Preferably a PhD in data science.8+ years of professional experience in software development. 2+ years of experience in project management.Experience in mentoring junior software engineers to improve their skills, and make them more effective, product software engineers.Experience in data structures, algorithm design, complexity analysis, object-oriented design.3+ years experience in at least one modern programming language such as Java, Scala, Python, C++, C#Experience in professional software engineering practices & best practices for the full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operationsExperience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs.Experience with building complex software systems that have been successfully delivered to customersProven ability to take a project from scoping requirements through actual launch of the project, with experience in the subsequent operation of the system in production''']for doc in nlp.pipe(text, disable=["tagger"]): print(f"spans: {[(e.start, e.text, e.label_) for e in doc.ents]}")
- 我們打印提取的實體:
spans: [(0, '2+ years', 'EXPERIENCE'), (7, 'professional software development', 'SKILLS'), (12, 'Programming', 'SKILLS'), (22, 'Java', 'SKILLS'), (24, 'C++', 'SKILLS'), (27, 'C#', 'SKILLS'), (30, 'object-oriented design', 'SKILLS'), (36, '1+ years', 'EXPERIENCE'), (41, 'contributing to the', 'SKILLS'), (46, 'design', 'SKILLS'), (48, 'architecture', 'SKILLS'), (50, 'design patterns', 'SKILLS'), (55, 'scaling', 'SKILLS'), (60, 'current systems', 'SKILLS'), (64, 'Bachelor', 'DIPLOMA'), (68, 'Computer Science', 'DIPLOMA_MAJOR'), (75, '8+ years', 'EXPERIENCE'), (82, 'software development', 'SKILLS'), (88, 'mentoring junior software engineers', 'SKILLS'), (103, 'product software engineers', 'SKILLS'), (110, 'data structures', 'SKILLS'), (113, 'algorithm design', 'SKILLS'), (116, 'complexity analysis', 'SKILLS'), (119, 'object-oriented design', 'SKILLS'), (135, 'Java', 'SKILLS'), (137, 'Scala', 'SKILLS'), (139, 'Python', 'SKILLS'), (141, 'C++', 'SKILLS'), (143, 'C#', 'SKILLS'), (148, 'professional software engineering', 'SKILLS'), (151, 'practices', 'SKILLS'), (153, 'best practices', 'SKILLS'), (158, 'software development', 'SKILLS'), (164, 'coding', 'SKILLS'), (167, 'code reviews', 'SKILLS'), (170, 'source control management', 'SKILLS'), (174, 'build processes', 'SKILLS'), (177, 'testing', 'SKILLS'), (180, 'operations', 'SKILLS'), (184, 'communicating', 'SKILLS'), (193, 'management', 'SKILLS'), (199, 'software product', 'SKILLS'), (204, 'technical designs', 'SKILLS'), (210, 'building complex software systems', 'SKILLS'), (229, 'scoping requirements', 'SKILLS')]
我們已經(jīng)成功地從文本中提取了所有的技能、經(jīng)驗?zāi)陻?shù)、文憑和文憑專業(yè)!接下來,我們加載關(guān)系提取模型并對實體之間的關(guān)系進(jìn)行分類。
注意:確保將 rel_pipe 和 rel_model 從腳本文件夾復(fù)制到您的主文件夾中:
腳本文件夾
import randomimport typerfrom pathlib import Pathimport spacyfrom spacy.tokens import DocBin, Docfrom spacy.training.example import Examplefrom rel_pipe import make_relation_extractor, score_relationsfrom rel_model import create_relation_model, create_classification_layer, create_instances, create_tensors# We load the relation extraction (REL) modelnlp2 = spacy.load("training/model-best")# We take the entities generated from the NER pipeline and input them to the REL pipelinefor name, proc in nlp2.pipeline:
doc = proc(doc)# Here, we split the paragraph into sentences and apply the relation extraction for each pair of entities found in each sentence.for value, rel_dict in doc._.rel.items():
for sent in doc.sents:
for e in sent.ents:
for b in sent.ents:
if e.start == value[0] and b.start == value[1]:
if rel_dict['EXPERIENCE_IN'] >=0.9 :
print(f" entities: {e.text, b.text} --> predicted relation: {rel_dict}")
在這里,我們顯示了所有具有Experience_in關(guān)系且置信度得分高于 90%的實體:
"entities":("2+ years", "professional software development"") --> predicted relation":
{"DEGREE_IN":1.2778723e-07,"EXPERIENCE_IN":0.9694631}"entities":"(""1+ years", "contributing to the"") -->
predicted relation":
{"DEGREE_IN":1.4581254e-07,"EXPERIENCE_IN":0.9205434}"entities":"(""1+ years","design"") -->
predicted relation":
{"DEGREE_IN":1.8895419e-07,"EXPERIENCE_IN":0.94121873}"entities":"(""1+ years","architecture"") -->
predicted relation":
{"DEGREE_IN":1.9635708e-07,"EXPERIENCE_IN":0.9399484}"entities":"(""1+ years","design patterns"") -->
predicted relation":
{"DEGREE_IN":1.9823732e-07,"EXPERIENCE_IN":0.9423302}"entities":"(""1+ years", "scaling"") -->
predicted relation":
{"DEGREE_IN":1.892173e-07,"EXPERIENCE_IN":0.96628445}entities: ('2+ years', 'project management') -->
predicted relation:
{'DEGREE_IN': 5.175297e-07, 'EXPERIENCE_IN': 0.9911635}"entities":"(""8+ years","software development"") -->
predicted relation":
{"DEGREE_IN":4.914319e-08,"EXPERIENCE_IN":0.994812}"entities":"(""3+ years","Java"") -->
predicted relation":
{"DEGREE_IN":9.288566e-08,"EXPERIENCE_IN":0.99975795}"entities":"(""3+ years","Scala"") -->
predicted relation":
{"DEGREE_IN":2.8477e-07,"EXPERIENCE_IN":0.99982494}"entities":"(""3+ years","Python"") -->
predicted relation":
{"DEGREE_IN":3.3149718e-07,"EXPERIENCE_IN":0.9998517}"entities":"(""3+ years","C++"") -->
predicted relation":
{"DEGREE_IN":2.2569053e-07,"EXPERIENCE_IN":0.99986637}
值得注意的是,我們能夠正確提取幾乎所有多年的經(jīng)驗以及他們各自的技能,沒有誤報或否定!
讓我們看看具有關(guān)系degree_in的實體:
entities: ('Bachelor / MS', 'Computer Science') -->
predicted relation:
{'DEGREE_IN': 0.9943974, 'EXPERIENCE_IN':1.8361954e-09} entities: ('PhD', 'data science') --> predicted relation: {'DEGREE_IN': 0.98883855, 'EXPERIENCE_IN': 5.2092592e-09}
再次,我們成功提取了文憑和文憑專業(yè)之間的所有關(guān)系!
這再次證明了使用少量注釋數(shù)據(jù)將轉(zhuǎn)換器模型微調(diào)到您自己的特定領(lǐng)域案例是多么容易,無論是用于 NER 還是關(guān)系提取。
只有一百個帶注釋的文檔,我們就能夠訓(xùn)練出一個性能良好的關(guān)系分類器。此外,我們可以使用這個初始模型以最少的校正自動注釋數(shù)百個未標(biāo)記的數(shù)據(jù)。這可以顯著加快注釋過程并提高模型性能。
結(jié)論
Transformers 真正改變了 NLP 領(lǐng)域,我對它們在信息提取中的應(yīng)用感到特別興奮。我想感謝 Explosion AI(spaCy 開發(fā)人員)和 Hugging Face 提供了促進(jìn)采用 Transformer 的開源解決方案。
如果您的項目需要數(shù)據(jù)注釋,請不要猶豫,嘗試使用 UBIAI 注釋工具。我們提供多種可編程標(biāo)簽解決方案(例如 ML 自動注釋、正則表達(dá)式、字典等),以最大限度地減少手動注釋。