0

我想在 Python 中使用 WordNet 和 NLTK 实现一个具有语义距离的基本文本相似性例程。这就是想法:使用同义词、下位词、上位词、分词、转喻扩展两个概念/短语/类别 A 和 B,并计算两个形成的向量 a 和 b 之间的距离。我不确定我将如何计算这些,也许是余弦距离。

大多数情况下,我的输入数据不是由短语组成,而是由专有名词或名词(带有品牌或产品类别的产品名称)组成。例如,我想确定“度假村”是“豪华酒店”或“黑鱼子酱”是“美食”,A - “黑鱼子酱”,B - “美食”。

这可以在多大程度上起作用,以及我如何在 WordNet 上上下移动以使其更复杂一点,然后使用hypo/hyper-nyms 上下一层。

我正在寻找运行良好的简单基本解决方案,而不是使用诸如 Whoosh 之类的复杂东西。

我应该使用比 WordNet 更好的东西吗?


更新:

我按以下方式处理每个名词短语(使用 NLTK 和 WordNet): 1. 对于短语中的每个单词,我收集一个同义词集(仅限名词),然后用同义词集中的每个元素的上位词和下位词的同义词集对其进行补充. 现在,我将所有同义词集合到列表中,忽略层次结构。2. 我对描述我的每个类别类别的关键字重复该过程。3. 现在我有每个类别和目标的同义词列表。只需计算到每个的距离(余弦或 Wu 和 Palmer 的距离)。我在我的两个向量中收集成对距离,将它们相加,通过描述类别或目标的关键字数量进行归一化。然后我选择一个最小距离。

听起来很基本且效率低下。下一步是什么让它变得更好?

我有兴趣从头开始,这也是了解事物如何工作以及需要如何完成的最佳练习。


示例:word_list - 目标:['school', 'kids', 'teacher']

类别:[['商业','组织','公司'],['教育','学校','大学']]

目标概念“教育”的扩展列表,3 个关键字:[Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school .n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('child.n.01') , Synset('kid.n.02'), Synset('kyd.n.01'), Synset('child.n.02'), Synset('kid.n.05'), Synset('老师。 n.01'), Synset('teacher.n.02'), Synset('educational_institution.n.01'), Synset('building.n.01'), Synset('education.n.03'), Synset('body.n.02'), Synset('time_period.n.01'), Synset('educational_institution.n.01'), Synset('animal_group.n.01'), Synset('academy.n .03'), Synset('alma_mater.n.01'), Synset('conservatory.n.01'),Synset('correspondence_school.n.01'), Synset('crammer.n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n .02'), Synset('direct-grant_school.n.01'), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01') , Synset('grade_school.n.01'), Synset('graduate_school.n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school. n.01'), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset('secondary_school.n.01'), Synset('secretarial_school.n.01'), Synset('sunday_school.n.01'), Synset('technical_school.n.01')、Synset('training_school.n.01')、Synset('veterinary_school.n.01')、Synset('conservatory.n.02')、Synset('day_school.n.03')、Synset( 'art_nouveau.n.01'), Synset('ashcan_school.n.01'), Synset('deconstructivism.n.01'), Synset('historical_school.n.01'), Synset('lake_poets.n.01 '), Synset('pointillism.n.01'), Synset('secession.n.01')]

类别概念'business'的扩展列表,3个关键字,扩展列表中的223个:[Synset('business.n.01'), Synset('commercial_enterprise.n.02'), Synset('occupation.n.01') , Synset('business.n.04'), Synset('business.n.05'), Synset('business.n.06'), Synset('business.n.07'), Synset('clientele. n.01'), Synset('business.n.09'), Synset('organization.n.01'), Synset('arrangement.n.03'), Synset('administration.n.02'), Synset('organization.n.04'), Synset('organization.n.05'), Synset('organization.n.06'), Synset('构成.n.02'), Synset('company.n .01')、Synset('company.n.02')、Synset('company.n.03')、Synset('company.n.04')、Synset('caller.n.01')、Synset ('company.n.06'),Synset('party.n.03'), Synset('ship's_company.n.01'), Synset('company.n.09'), Synset('enterprise.n.02'), Synset('commerce .n.01'), Synset('activity.n.01'), Synset('concern.n.04'), Synset('aim.n.02'), Synset('business_activity.n.01') , Synset('sector.n.02'), Synset('people.n.01'), Synset('acting.n.01'), Synset('social_group.n.01'), Synset('结构。 n.03'), Synset('body.n.02'), Synset('administration.n.01'), Synset('orderliness.n.01'), Synset('activity.n.01'), Synset('beginning.n.05'), Synset('institution.n.01'), Synset('army_unit.n.01'), Synset('friendship.n.01'), Synset('organization.n .01'), Synset('visitor.n.01'), Synset('social_gathering.n.01'),Synset('set.n.05'), Synset('complement.n.03'), Synset('unit.n.03'), Synset('agency.n.02'), Synset('brokerage.n .02'), Synset('carrier.n.05'), Synset('chain.n.04'), Synset('firm.n.01'), Synset('franchise.n.02'), Synset ('manufacturer.n.01'), Synset('partnership.n.01'), Synset('processor.n.01'), Synset('shipbuilder.n.03'), Synset('underperformer.n. 02')、Synset('advertising.n.02')、Synset('agribusiness.n.01')、Synset('butchery.n.02')、Synset('construction.n.07')、Synset( 'discount_business.n.01'), Synset('employee-owned_enterprise.n.01'), Synset('field.n.06'), Synset('finance.n.01'), Synset('fishing.n .02'), Synset('industry.n.02'), Synset('packaging.n.01'),Synset('printing.n.02'), Synset('publication.n.04'), Synset('real-estate_business.n.01'), Synset('storage.n.03'), Synset('tourism .n.01'), Synset('transportation.n.05'), Synset('venture.n.03'), Synset('accountancy.n.01'), Synset('appointment.n.05') , Synset('career.n.01'), Synset('餐饮.n.01'), Synset('confectionery.n.03'), Synset('employment.n.02'), Synset('farming. n.02'), Synset('game.n.10'), Synset('metier.n.02'), Synset('photography.n.03'), Synset('position.n.06'), Synset('profession.n.02'), Synset('sport.n.02'), Synset('trade.n.02'), Synset('treadmill.n.03'), Synset('occasions.n .01'), Synset('land-office_business.n.01'), Synset('trade.n.03'),Synset('big_business.n.01'), Synset('shtik.n.02'), Synset('adhocracy.n.01'), Synset('affiliate.n.02'), Synset('alliance.n .03'), Synset('association.n.01'), Synset('blue.n.03'), Synset('bureaucracy.n.03'), Synset('company.n.04'), Synset ('defense.n.09'), Synset('deputation.n.01'), Synset('enterprise.n.02'), Synset('establishment.n.05'), Synset('federation.n. 01'), Synset('fiefdom.n.02'), Synset('fire_brigade.n.01'), Synset('force.n.04'), Synset('girl_scouts.n.01'), Synset( 'grey.n.04'), Synset('hierarchy.n.02'), Synset('host.n.06'), Synset('institution.n.01'), Synset('line_of_defense.n.01 '), Synset('line_organization.n.01'), Synset('machine.n.03'),Synset('machine.n.05'), Synset('musical_organization.n.01'), Synset('non Government_organization.n.01'), Synset('party.n.01'), Synset('peace_corps.n .01')、Synset('polity.n.02')、Synset('pool.n.03')、Synset('professional_organization.n.01')、Synset('quango.n.01')、Synset ('tammany_hall.n.01'), Synset('union.n.01'), Synset('unit.n.03'), Synset('calendar.n.01'), Synset('classification_system.n. 01')、Synset('contrivance.n.04')、Synset('coordinate_system.n.01')、Synset('data_structure.n.01')、Synset('design.n.02')、Synset( 'distribution.n.01'), Synset('genetic_map.n.01'), Synset('kinship_system.n.01'), Synset('lattice.n.01'), Synset('living_arrangement.n.01 '), 同义词集('ontology.n.01'), Synset('county_council.n.01'), Synset('curia.n.01'), Synset('executive.n.02'), Synset('government_officials.n.01' ), Synset('judiciary.n.01'), Synset('management.n.02'), Synset('top_brass.n.01'), Synset('nonprofit_organization.n.01'), Synset('合理化.n.04'), Synset('reorganization.n.01'), Synset('self-organization.n.01'), Synset('syndication.n.01'), Synset('listing.n.02 '), Synset('order.n.15'), Synset('randomization.n.01'), Synset('systematization.n.01'), Synset('territorialization.n.01'), Synset('集体化.n.01'), Synset('colonization.n.01'), Synset('communization.n.02'), Synset('federation.n.03'), Synset('unionization.n.01' ), 同义词集('broadcasting_company.n.01'), Synset('bureau_de_change.n.01'), Synset('car_company.n.01'), Synset('closed_shop.n.01'), Synset('corporate_investor.n.01' ), Synset('distributor.n.03'), Synset('dot-com.n.01'), Synset('drug_company.n.01'), Synset('east_india_company.n.01'), Synset( 'electronics_company.n.01')、Synset('film_company.n.01')、Synset('food_company.n.01')、Synset('furniture_company.n.01')、Synset('holding_company.n.01' ')、Synset('joint-stock_company.n.01')、Synset('limited_company.n.01')、Synset('livery_company.n.01')、Synset('mining_company.n.01')、Synset ('mover.n.04'), Synset('oil_company.n.01'), Synset('open_shop.n.01'), Synset('packaging_company.n.01'), Synset('pipeline_company.n.01'), Synset('printing_concern.n.01'), Synset('record_company.n.01'), Synset('service.n.04'), Synset('shipper.n.02' ), Synset('shipping_company.n.01'), Synset('steel_company.n.01'), Synset('stock_company.n.01'), Synset('subsidiary_company.n.01'), Synset('target_company .n.01'), Synset('think_tank.n.01'), Synset('transportation_company.n.01'), Synset('union_shop.n.01'), Synset('white_knight.n.01') , Synset('trainband.n.01'), Synset('freemasonry.n.01'), Synset('ballet_company.n.01'), Synset('chorus.n.05'), Synset('circus. n.01'), Synset('minstrel_show.n.01'), Synset('minstrelsy.n.01'), Synset('opera_company.n.01'), Synset('theater_company.n.01'),同义词集('出勤.n.03'), Synset('cohort.n.01'), Synset('number.n.07'), Synset('fatigue_party.n.01'), Synset('landing_party.n.01' ), Synset('party_to_the_action.n.01'), Synset('rescue_party.n.01'), Synset('search_party.n.01'), Synset('stretcher_party.n.01'), Synset('war_party .n.01')]

类别概念“教育”的扩展列表 - 97 个同义词集:[Synset('education.n.01')、Synset('education.n.02')、Synset('education.n.03')、Synset('education .n.04'), Synset('education.n.05'), Synset('department_of_education.n.01'), Synset('school.n.01'), Synset('school.n.02') , Synset('school.n.03'), Synset('school.n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school. n.07'), Synset('university.n.01'), Synset('university.n.02'), Synset('university.n.03'), Synset('activity.n.01'), Synset('content.n.05'), Synset('learning.n.01'), Synset('profession.n.02'), Synset('upbringing.n.01'), Synset('executive_department.n .01'), Synset('educational_institution.n.01'), Synset('building.n.01'), Synset('education.n.03'), Synset('body.n.02'), Synset('time_period.n.01'), Synset('educational_institution.n.01'), Synset('animal_group.n.01'), Synset('body.n.02'), Synset('establishment.n.04'), Synset('educational_institution.n .01')、Synset('coeducation.n.01')、Synset('continuing_education.n.01')、Synset('course.n.01')、Synset('elementary_education.n.01')、Synset ('extension.n.04'), Synset('extracurricular_activity.n.01'), Synset('higher_education.n.01'), Synset('secondary_education.n.01'), Synset('team_teaching.n. 01'), Synset('work-study_program.n.01'), Synset('enlightenment.n.01'), Synset('eruditeness.n.01'), Synset('experience.n.01'),Synset('foundation.n.04'), Synset('physical_education.n.01'), Synset('acculturation.n.03'), Synset('mastering.n.01'), Synset('school.n .03'), Synset('self-education.n.01'), Synset('special_education.n.01'), Synset('vocational_training.n.01'), Synset('teaching.n.01') , Synset('academy.n.03'), Synset('alma_mater.n.01'), Synset('conservatory.n.01'), Synset('correspondence_school.n.01'), Synset('crammer. n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n.02'), Synset('direct-grant_school.n.01' ), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01'), Synset('grade_school.n.01'), Synset('graduate_school .n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school.n.01'), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset('secondary_school.n.01'), Synset('secretarial_school.n .01')、Synset('sunday_school.n.01')、Synset('technical_school.n.01')、Synset('training_school.n.01')、Synset('veterinary_school.n.01')、Synset ('conservatory.n.02'), Synset('day_school.n.03'), Synset('art_nouveau.n.01'), Synset('ashcan_school.n.01'), Synset('deconstructivism.n. 01'), Synset('historical_school.n.01'), Synset('lake_poets.n.01'), Synset('pointillism.n.01'), Synset('secession.n.01'), Synset('gown.n.02'), Synset('varsity.n.01'), Synset('city_university.n.01'), Synset('oxbridge.n.01'), Synset('redbrick_university .n.01'), Synset('multiversity.n.01'), Synset('open_university.n.01')]

我的目标的扩展列表,57 个同义词集:[Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school.n. 04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('child.n.01'), Synset( 'kid.n.02'), Synset('kyd.n.01'), Synset('child.n.02'), Synset('kid.n.05'), Synset('teacher.n.01 '), Synset('teacher.n.02'), Synset('educational_institution.n.01'), Synset('building.n.01'), Synset('education.n.03'), Synset(' body.n.02'), Synset('time_period.n.01'), Synset('educational_institution.n.01'), Synset('animal_group.n.01'), Synset('academy.n.03' ), Synset('alma_mater.n.01'), Synset('conservatory.n.01'), Synset('communication_school.n.01'), Synset('crammer.n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n.02' ), Synset('direct-grant_school.n.01'), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01'), Synset( 'grade_school.n.01'), Synset('graduate_school.n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school.n.01 '), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset(' secondary_school.n.01'), Synset('secretarial_school.n.01'), Synset('sunday_school.n.01'), Synset('technical_school.n.01'),Synset('training_school.n.01'), Synset('veterinary_school.n.01'), Synset('conservatory.n.02'), Synset('day_school.n.03'), Synset('art_nouveau.n .01')、Synset('ashcan_school.n.01')、Synset('deconstructivism.n.01')、Synset('historical_school.n.01')、Synset('lake_poets.n.01')、Synset ('pointillism.n.01'), Synset('secession.n.01')]


我有 3 个向量,目标 - 57、业务 - 223 和教育 - 97。

现在计算目标和业务之间的成对 Wu 和 Palmer 距离,除以 57x223=12711;在目标和教育之间,除以 57x97=5529。

目标到业务距离:2305.709117171037 / 5529 = 0.9125370052417936 目标到教育距离:5045.417101981877 / 12711 = 0.39693313680921066

最小距离是教育。这是一个正确的答案。

4

2 回答 2

0

WordNet + 一些相似性可能是一个解决方案。您还可以使用 Word2Vec 来确定从 WordNet synset/*nyms 搜索中获得的单词的语义距离。

也许有人可以帮助一个特定的库(目前我没有想到你可以直接使用)。

于 2016-07-18T13:38:16.690 回答
0

语义表示的 Word2Vec + 如下论文中描述的最大似然方法将是合并两个分类法的好方法:http ://www.ideal.ece.utexas.edu/papers/rajan05aaai.pdf

于 2017-04-06T19:56:18.790 回答