分类信息发帖 【数据挖掘】从“文本”到“知识”:信息抽取(Information Extrac

这是一个大数据的时代。随着太阳东升西落,每一天都在产生大量的数据信息。人们通常更擅长处理诸如数字之类的结构化数据。但实际情况是,非结构化数据往往比结构化的数据多。

从“文本”到“知识”:信息抽取

这是一个大数据的时代。随着太阳东升西落,每一天都在产生大量的数据信息。人们通常更擅长处理诸如数字之类的结构化数据。但实际情况是,非结构化数据往往比结构化的数据多。

当我们从互联网上获取了大量的如文本之类的非结构化数据,如何去有效地分析它们来帮助我们更好地做出决策呢?这将是本文要回答的问题。

信息提取是从非结构化数据(例如文本)中提取结构化信息的任务。我将这个过程分为以下四个步骤进行处理。

1.共指解析( )

“共指解析”就是在文本中查找引用特定实体的所有表达式。简单来说就是解决代指问题。比如, has a dog. She loves him. 我们知道这里的she指代的是,him指代的是dog,但是计算机不知道。

我这里使用模型解决这个问题,该模型基于SpaCy框架运行。值得注意的是,模型可能不太适用于位置代词。代码如下:

import spacy
import neuralcoref
# Load SpaCy
nlp = spacy.load('en')
# Add neural coref to SpaCy's pipe
neuralcoref.add_to_pipe(nlp)
def coref_resolution(text):
    """Function that executes coreference resolution on a given text"""
    doc = nlp(text)
    # fetches tokens with whitespaces from spacy document
    tok_list = list(token.text_with_ws for token in doc)
    for cluster in doc._.coref_clusters:
        # get tokens from representative cluster name
        cluster_main_words = set(cluster.main.text.split(' '))
        for coref in cluster:
            if coref != cluster.main:  # if coreference element is not the representative element of that cluster
                if coref.text != cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words)) == False:
                    # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                    tok_list[coref.start] = cluster.main.text + 
                        doc[coref.end-1].whitespace_
                    for i in range(coref.start+1, coref.end):
                        tok_list[i] = ""
    return "".join(tok_list)

如果以下面这段文本作为输入:

Elon Musk is a , , and . He is the , CEO, CTO, and chief of . He is also early , CEO, and of Tesla, Inc. He is also the of The and the co- of . A , Musk the in the world in 2021, with an net worth of $185 at the time, Jeff Bezos. Musk was born to a and South and in , South . He the of to aged 17 to Queen's . He to the of two years later, where he dual 's in and . He moved to in 1995 to , but to a . He went on co- a web Zip2 with his Musk.

则输出为:

Elon Musk is a , , and 。 Elon Musk is the , CEO, CTO, and chief of 。 Elon Musk is also early , CEO, and of Tesla, Inc。 Elon Musk is also the of The and the co- of 。 A , Musk the in the world in 2021, with an net worth of $185 at the time, Jeff Bezos。 Musk was born to a and South and in , South 。

Elon Musk the of to aged 17 to Queen's 。 Elon Musk to the of two years later, where Elon Musk dual 's in and 。 Elon Musk moved to in 1995 to , but to a 。 Elon Musk went on co- a web Zip2 with Elon Musk Musk。

2.命名实体链接(Named )

这部分我使用的是 API。大家可以自己尝试一下。 API:

在通过 API运行输入文本之前,我们将文本拆分为句子并删除标点符号。代码如下:

import urllib
from string import punctuation
import nltk
ENTITY_TYPES = ["human", "person", "company", "enterprise", "business", "geographic region",
                "human settlement", "geographic entity", "territorial entity type", "organization"]
def wikifier(text, lang="en", threshold=0.8):
    """Function that fetches entity linking results from wikifier.com API"""
    # Prepare the URL.
    data = urllib.parse.urlencode([
        ("text", text), ("lang", lang),
        ("userKey", "tgbdmkpmkluegqfbawcwjywieevmza"),
        ("pageRankSqThreshold", "%g" %
         threshold), ("applyPageRankSqThreshold", "true"),
        ("nTopDfValuesToIgnore", "100"), ("nWordsToIgnoreFromList", "100"),
        ("wikiDataClasses", "true"), ("wikiDataClassIds", "false"),
        ("support", "true"), ("ranges", "false"), ("minLinkFrequency", "2"),
        ("includeCosines", "false"), ("maxMentionEntropy", "3")
    ])
    url = "http://www.wikifier.org/annotate-article"
    # Call the Wikifier and read the response.
    req = urllib.request.Request(url, data=data.encode("utf8"), method="POST")
    with urllib.request.urlopen(req, timeout=60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Output the annotations.
    results = list()
    for annotation in response["annotations"]:
        # Filter out desired entity classes
        if ('wikiDataClasses' in annotation) and (any([el['enLabel'] in ENTITY_TYPES for el in annotation['wikiDataClasses']])):
            # Specify entity label
            if any([el['enLabel'] in ["human", "person"] for el in annotation['wikiDataClasses']]):
                label = 'Person'
            elif any([el['enLabel'] in ["company", "enterprise", "business", "organization"] for el in annotation['wikiDataClasses']]):
                label = 'Organization'
            elif any([el['enLabel'] in ["geographic region", "human settlement", "geographic entity", "territorial entity type"] for el in annotation['wikiDataClasses']]):
                label = 'Location'
            else:
                label = None
            results.append({'title': annotation['title'], 'wikiId': annotation['wikiDataItemId'], 'label': label,
                            'characters': [(el['chFrom'], el['chTo']) for el in annotation['support']]})
    return results

分类信息发帖 【数据挖掘】从“文本”到“知识”:信息抽取(Information Extrac

以上一步中共指解析的输出作为这里的输入,运行得到如下结果。

可以看到,我们还获得了实体对应的及其label。可以消除同名实体的歧义问题。虽然维基百科拥有超过1亿个实体,但对于维基百科上不存在的实体,仍将无法识别它。

3.关系提取( )

我使用项目来实现关系提取。它具有五个在或数据集上经过训练的开源关系提取模型。在数据集上训练的模型可以推断80种关系类型。

这里,我使用了模型(需要GPU提供支持)。

如果我们看一下库中的关系提取示例,我们会注意到它仅推断关系,而不尝试提取命名实体。所以我们必须提供一对带有h和t参数的实体,然后模型尝试推断出一个关系。

# 示例
model.infer({'text': 'He was the son of Máel Dúin mac Máele Fithrich, and grandson of the high king Áed Uaridnach (died 612).', 
			    'h': {'pos': (18, 46)}, 
			    't': {'pos': (78, 91)}
			    })

# 结果
('father', 0.5108704566955566)

实现代码如下所示:

# First get all the entities in the sentence
entities = wikifier(sentence, threshold=entities_threshold)
# Iterate over every permutation pair of entities
for permutation in itertools.permutations(entities, 2):
    for source in permutation[0]['characters']:
        for target in permutation[1]['characters']:
            # Relationship extraction with OpenNRE
            data = relation_model.infer(
                {'text': sentence, 'h': {'pos': [source[0], source[1] + 1]}, 't': {'pos': [target[0], target[1] + 1]}})
            if data[1] > relation_threshold:
                relations_list.append(
                    {'source': permutation[0]['title'], 'target': permutation[1]['title'], 'type': data[0]})

用“命名实体链接”的结果作为“关系提取过程”的输入。我们遍历一对实体的每个排列组合来尝试推断一个关系。用ld参数,用于省略较低的置信度关系。在下一步中,我会解释为什么这里使用实体的所有排列而不是一些组合。

运行结果如下:

关系提取是一个具有挑战性的问题,就目前来说,很难有一个完美的结果。

4.知识图谱( Graph)

当我们处理实体及其关系时,将结果存储在图形数据库中才有意义。这里我使用Neo4j作为图数据库。

我尝试推断实体之间所有排列的关系。查看上一步中的表格结果,很难解释为什么。但在图形可视化中,很容易观察到,虽然大多数关系都是双向推断的,但并非在所有情况下都是如此。

现在最火的发帖平台

论坛发帖平台 人民不再“灌水”:中国论坛二十余年兴衰史

2023-10-22 0:02:18

现在最火的发帖平台

快手如何推广产品 快手如何推广自己的产品?这些办法可以借鉴

2023-10-22 0:02:29

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索