我如何检测(最好使用 Python)可以用不同比特率编码的重复 MP3 文件(但它们是同一首歌曲)和可能不正确的 ID3 标签?
我知道我可以对文件内容进行MD5校验和,但这不适用于不同的比特率。而且我不知道 ID3 标签是否会影响生成 MD5 校验和。我应该重新编码具有不同比特率的 MP3 文件,然后我可以进行校验和吗?你有什么建议吗?
旧 AudioScrobbler 和现在MusicBrainz的人们很久以前就一直在研究同样的问题。目前,可以帮助您完成任务的 Python 项目是Picard,它将使用 GUID(实际上是其中的几个)标记音频文件(不仅是 MPEG 1 第 3 层文件),从那时起,匹配标签非常简单。
如果您更喜欢将其作为自己的项目来做,libofa可能会有所帮助。
就像其他人说的那样,简单的校验和不会检测到具有不同比特率或 ID3 标签的重复项。您需要的是音频指纹算法。Python Audioprocessing Suite 有这样一个算法,但我不能说它有多可靠。
For tag issues, Picard may indeed be a very good bet. If, having identified two potentially duplicate files, what you want is to extract bitrate information from them, have a look at mp3guessenc.
Dejavu 项目是用 Python 编写的,完全符合您的要求。
https://github.com/worldveil/dejavu
它还支持许多常见格式(.wav、.mp3 等)以及在原始音轨中查找剪辑的时间偏移。
I don't think simple checksums will ever work:
I think you'll have to compare ID3 tags, song length, and filenames.
Re-encoding at the same bit rate won't work, in fact it may make things worse as transcoding (that is what re-encoding at different bitrates is called) is going to change the nature of the compression, you are recompressing an already compressed file is going to lead to a significantly different file.
This is a little out of my league but I would approach the problem by looking at the wave pattern of the MP3. Either by converting the MP3 to an uncompressd .wav or maybe by just running the analysis on the MP3 file itself. There should be a library out there for this. Just a word of warning, this is an expensive operation.
Another idea, use ReplayGain to scan the files. If they are the same song, they should be be tagged with the same gain. This will only work on the exact same song from the exact same album. I know of several cases were reissues are remastered at a higher volume, thus changing the replaygain.
EDIT:
You might want to check out http://www.speech.kth.se/snack/, which apparently can do spectrogram visualization. I imagine any library that can visual spectrogram can help you compare them.
This link from the official python page may also be helpful.
我正在寻找类似的东西,我发现了这个:
http ://www.lastfm.es/user/nova77LF/journal/2007/10/12/4kaf_fingerprint_(command_line)_client
希望能帮助到你。
我会使用长度作为我的主要启发式方法。这就是 iTunes 在尝试使用Gracenote 数据库识别 CD 时所做的事情。以毫秒而不是秒为单位测量长度。请记住,这只是一种启发式方法:您绝对应该在删除之前听取任何检测到的重复项。
您可以使用 PUID 和 MusicBrainz 的后继者,称为AcoustiD:
AcoustID 是一个开源项目,旨在创建一个免费的音频指纹数据库,并映射到 MusicBrainz 元数据数据库,并使用该数据库提供用于音频文件识别的网络服务......
...指纹以及将歌曲识别到 AcoustID 数据库所需的一些元数据...
您可以在https://acoustid.org/找到各种客户端库和 Web 服务示例