0

我在 df.sentence 列中有一个数据框有很长的句子。我正在尝试使用语义角色标签提取 arg0 并将 arg0 保存在单独的列中。

我不断收到此错误:

RuntimeError: The size of tensor a (1212) must match the size of tensor b (512) at non-singleton dimension 1

这是我的代码:

!pip install allennlp==2.1.0 allennlp-models==2.1.0
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
import pandas as pd, csv
 

def extract_arg0(sentence):
  result = []
  output = predictor.predict(sentence)
  for verb in output['verbs']:
    desc = verb['description']
    arg0_start = desc.find('ARG0: ')
    if arg0_start > -1:
      arg0_end = arg0_start + len('ARG0: ')
      arg0 = desc[arg0_end: desc.find(']')]
      result.append((verb['verb'], arg0))
  return result

# How to loop over all sentences
from tqdm.notebook import tqdm
tqdm.pandas()

df['Arg0'] = df.sentence.progress_apply(extract_arg0)

我想我应该在这里创建一个可以跳过而不是抛出错误的新代码行,并添加 df.arg0 'failed' .. 等等。我的方法对吗?如果是这样,关于如何在我的代码中添加该行的任何想法?如果没有,任何建议将不胜感激。

注意:我认为最合适的方法是继续使用 longformer。我还检查了 longformer 的任何方法,找不到任何方法。我也将不胜感激对此的任何建议。

我也试过

'!pip install allennlp==2.1.0 allennlp-models==2.9.0'

我的数据示例:

import pandas as pd

data = {'sentence': ['in addition, our regulatory posture and related expenses have been and will continue to be affected by changes in regulatory expectations for global systemically important financial institutions applicable to, among other things, risk management, liquidity and capital planning and compliance programs, and changes in governmental enforcement approaches to perceived failures to comply with regulatory or legal obligations;•adverse changes in the regulatory ratios that we are required or will be required to meet, whether arising under the dodd-frank act or the basel iii final rule, or due to changes in regulatory positions, practices or regulations in jurisdictions in which we engage in banking activities, including changes in internal or external data, formulae, models, assumptions or other advanced systems used in the calculation of our capital ratios that cause changes in those ratios as they are measured from period to period;•increasing requirements to obtain the prior approval of the federal reserve or our other u.s. and non-u.s. regulators for the use, allocation or distribution of our capital or other specific capital actions or programs, including acquisitions, dividends and stock purchases, without which our growth plans, distributions to shareholders, share repurchase programs or other capital initiatives may be restricted;•changes in law or regulation, or the enforcement of law or regulation, that may adversely affect our business activities or those of our clients or our counterparties, and the products or services that we sell, including additional or increased taxes or assessments thereon, capital adequacy requirements, margin requirements and changes that expose us to risks related to the adequacy of our controls or compliance programs;•financial market disruptions or economic recession, whether in the u.s., europe, asia or other regions;•our ability to develop and execute state street beacon, our multi-year program to create cost efficiencies through changes to our operations and to further digitize our service delivery to our clients, any failure of which, in whole or in part, may among other things, reduce our competitive position, diminish the cost-effectiveness of our systems and processes or provide an  insufficient return on our associated investment;•our ability to promote a strong culture of risk management, operating controls, compliance oversight and governance that meet our expectations and those of our clients and our regulators;•the results of our review of the manner in which we invoiced certain client expenses, including the amount of expenses determined to be reimbursable, as well as potential consequences of such review including with respect to our client relationships and potential investigations by regulators;•the results of, and costs associated with, governmental or regulatory inquiries and investigations, litigation and similar claims, disputes, or proceedings;•the potential for losses arising from our investments in sponsored investment funds;•the possibility that our clients will incur substantial losses in investment pools for which we act as agent, and the possibility of significant reductions in the liquidity or valuation of assets underlying those pools;•our ability to anticipate and manage the level and timing of redemptions and withdrawals from our collateral pools and other collective investment products;•the credit agency ratings of our debt and depository obligations and investor and client perceptions of our financial strength;•adverse publicity, whether specific to state street or regarding other industry participants or industry-wide factors, or other reputational harm;•our ability to control operational risks, data security breach risks and outsourcing risks, our ability to protect our intellectual property rights, the possibility of errors in the quantitative models we use to manage our business and the possibility that our controls will prove insufficient, fail or be circumvented;•our ability to expand our use of technology to enhance the efficiency, accuracy and reliability of our operations and our dependencies on information technology and our ability to control related risks, including cyber-crime and other threats to our information technology infrastructure and systems and their effective operation both independently and with external systems, and complexities and costs of protecting the security of our systems and data;18 •our ability to grow revenue, manage expenses, attract and retain highly skilled people and raise the capital necessary to achieve our business goals and comply with regulatory requirements and expectations;•changes or potential changes to the competitive environment, including changes due to regulatory and technological changes, the effects of industry consolidation and perceptions of state street as a suitable service provider or counterparty;•changes or potential changes in the amount of compensation we receive from clients for our services, and the mix of services provided by us that clients choose;•our ability to complete acquisitions, joint ventures and divestitures, including the ability to obtain regulatory approvals, the ability to arrange financing as required and the ability to satisfy closing conditions;•the risks that our acquired businesses and joint ventures will not achieve their anticipated financial and operational benefits or will not be integrated successfully, or that the integration will take longer than anticipated, that expected synergies will not be achieved or unexpected negative synergies or liabilities will be experienced, that client and deposit retention goals will not be met, that other regulatory or operational challenges will be experienced, and that disruptions from the transaction will harm our relationships with our clients, our employees or regulators;•our ability to recognize emerging needs of our clients and to develop products that are responsive to such trends and profitable to us, the performance of and demand for the products and services we offer, and the potential for new products and services to impose additional costs on us and expose us to increased operational risk;•changes in accounting standards and practices; and•changes in tax legislation and in the interpretation of existing tax laws by u.s. and non-u.s. tax authorities that affect the amount of taxes due.actual outcomes and results may differ materially from what is expressed in our forward-looking statements and from our historical financial results due to the factors discussed in this section and elsewhere in this form 10-k or disclosed in our other sec filings.', 'we have many risks to deal with', ' financial risk causes us to lose billions of dollars'],
      'second': ['a1', 'a1', 'a3']}
df= pd.DataFrame(data)
4

1 回答 1

0

您没有显示predictor您正在加载哪种类型,但我怀疑该模型只能处理 512 个单词。也许 Longformer 会是一个解决方案,但是你必须先用 Longformer 训练 SRL 模型。

想想你真正想要完成的事情。您的例句实际上是多个句子,最后有一个项目符号列表,其中包含更多句子。AllenNLP SRL 模型从未在这种输入数据上进行过训练,并且无论如何都不会表现良好。我建议您将输入分成句子并一次输入一个句子。这将更接近模型在训练时看到的数据类型,因此您将从中获得更好的结果。

于 2022-02-24T01:58:29.923 回答