0

I have a list of books titles:

  • "The Hobbit: 70th Anniversary Edition"
  • "The Hobbit"
  • "The Hobbit (Illustrated/Collector Edition)[There and Back Again]"
  • "The Hobbit: or, There and Back Again"
  • "The Hobbit: Gift Pack"

and so on...


I thought that if I normalised the titles somehow, it would be easier to implement an automated way to know what book each edition is referring to.

normalised = ''.join([char for char in title 
                       if char in (string.ascii_letters + string.digits)])

or

normalised = ''
for char in title:
  if char in ':/()|':
    break
  normalised += char
return normalised

But obviously they are not working as intended, as titles can contain special characters and editions can basically have very different title layouts.


Help would be very much appreciated! Thanks :)

4

2 回答 2

1

这完全取决于您的数据。对于您提供的示例,一个简单的标准化解决方案可能是:

import re

book_normalized = re.sub(r':.*|\[.*?\]|\(.*?\)|\{.*?\}', '', book_name).strip()

这将为所有示例返回“The Hobbit”。它的作用是删除第一个冒号之后的所有内容,包括第一个冒号,或括号中的任何内容(普通、方形、卷曲)以及前导和尾随空格。

但是,这在一般情况下并不是一个很好的解决方案,因为有些书的实际书名中有冒号或括号部分。例如,系列的名称,后跟一个冒号,然后是系列的特定条目的名称。

于 2010-03-16T22:59:09.643 回答
1

我建议使用 3rd 方网络服务,例如librarything,我相信它可以满足您的要求,作为起点,请参阅他们的文档:

http://www.librarything.com/services/rest/documentation/1.0/librarything.ck.getwork.php

于 2010-03-17T02:50:31.223 回答