python - 如何跳过包含过多搜索结果的标题（或从 Scopus 检索信息的时间过长）？

Question

我想访问 ScopusSearch API 并获取保存在 excel 电子表格中的 1400 篇文章标题列表的 EID。我尝试通过以下代码检索 EID：

import numpy as np
import pandas as pd
from pybliometrics.scopus import ScopusSearch
nan = pd.read_excel(r'C:\Users\Apples\Desktop\test\titles_nan.xlsx', sheet_name='nan')
error_index = {}

for i in range(0,len(nan)):
   scopus_title = nan.loc[i ,'Title']
   s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
   print('TITLE("{0}")'.format(scopus_title))
   try:
      s = ScopusSearch(scopus_title)
      nan.at[i,'EID'] = s.results[0].eid
      print(str(i) + ' ' + s.results[0].eid)
   except:
      nan.loc[i,'EID'] = np.nan
      error_index[i] = scopus_title
      print(str(i) + 'error' )

但是，我永远无法检索超过 100 个标题（大约）的 EID，因为某些标题会产生太多搜索，这会阻碍整个过程。

因此，我想跳过包含太多搜索的标题并转到下一个标题，同时保留被跳过的标题的记录。

我刚开始使用 Python，所以我不确定如何去做。我有以下顺序：

• 如果标题产生 1 次搜索，则检索 EID 并将其记录在文件“nan”的“EID”列下。

• 如果标题产生超过1 次搜索，则将标题记录在错误索引中，打印'Too many searchs' 并继续进行下一个搜索。

• 如果标题没有产生任何搜索，将标题记录在错误索引中，打印“错误”并继续下一个搜索。

Attempt 1
for i in range(0,len(nan)):
   scopus_title = nan.at[i ,'Title']
   print('TITLE("{0}")'.format(scopus_title))
s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
print(type(s))

if(s.count()== 1):
    nan.at[i,"EID"] = s.results[0].eid
    print(str(i) + "   " + s.results[0].eid)
elif(s.count()>1):
    continue
    print(str(i) + "  " + "Too many searches")
else:
    error_index[i] = scopus_title
    print(str(i) + "error")

Attempt 2
for i in range(0,len(nan)):
    scopus_title = nan.at[i ,'Title']<br/>
    print('TITLE("{0}")'.format(scopus_title))<br/>
    s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
    if len(s.results)== 1:
        nan.at[i,"EID"] = s.results[0].eid
        print(str(i) + "   " + s.results[0].eid)
    elif len(s.results)>1:  
        continue
        print(str(i) + "  " + "Too many searches")
    else:
        continue
        print(str(i) + "  " + "Error")

我收到错误，指出“ScopusSearch”类型的对象没有 len() /count() 或搜索或本身没有列表。我无法从这里继续。此外，我不确定这是否是正确的做法——根据太多搜索跳过标题。是否有更有效的方法（例如超时——在搜索花费一定时间后跳过标题）。

非常感谢您对此事的任何帮助。谢谢！

score 0 · Accepted Answer

结合：.get_results_size()_download=False

from pybliometrics.scopus import ScopusSearch

scopus_title = "Editorial"
q = f'TITLE("{scopus_title}")'  # this is f-string notation, btw
s = ScopusSearch(q, download=False)
s.get_results_size()
# 243142

如果此数字低于某个阈值，只需执行s = ScopusSearch(q)“尝试 2”中的操作并继续：

for i, row in nan.iterrows():
    q = f'TITLE("{row['Title']}")'
    print(q)
    s = ScopusSearch(q, download=False)
    n = s.get_results_size()
    if n == 1:
        s = ScopusSearch(q)
        nan.at[i,"EID"] = s.results[0].eid
        print(f"{i} s.results[0].eid")
    elif n > 1:
        print(f"{i} Too many results")
        continue  # must come last
    else:
        print(f"{i} Error")
        continue  # must come last

（我使用.iterrows()这里来摆脱索引。但是i如果索引不是范围序列，则将不正确 - 在这种情况下，将所有内容都包含在中enumerate()。）

python - 如何跳过包含过多搜索结果的标题（或从 Scopus 检索信息的时间过长）？

1 回答 1

Related

Reference