0

I'm having trouble matching a string with regexp (I'm not that experienced with regexp). I have a string which contains a forward slash after each word and a tag. An example:

led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION

In those strings, I am only interested in all strings that precede /PERSON. Here's the regexp pattern that I came up with:

(\w)*\/PERSON

And my code:

match = re.findall(r'(\w)*\/PERSON', string)

Basically, I am matching any word that comes before /PERSON. The output:

>>> reg
['Timothy', '', 'Geithner']

My problem is that the second match, matched to an empty string as for R./PERSON, the dot is not a word character. I changed my regexp to:

match = re.findall(r'(\w|.*?)\/PERSON', string)

But the match now is:

['led/O by/O Timothy', ' R.', ' Geithner']

It is taking everything prior to the first /PERSON which includes led/O by/O instead of just matching Timothy. Could someone please help me on how to do this matching, while including a full stop as an abbreviation? Or at least, not have an empty string match?

Thanks,

4

2 回答 2

1

Match everything but a space character ([^ ]*). You also need the star (*) inside the capture:

match = re.findall(r'([^ ]*)\/PERSON', string)
于 2013-03-31T03:17:53.783 回答
1

Firstly, (\w|.) matches "a word character, or any character" (dot matches any character which is why you're getting those spaces).

Escaping this with a backslash will do the trick: (\w|\.)

Second, as @Ionut Hulub points out you may want to use + instead of * to ensure you match something but Regular Expressions work on the principle of "leftmost, longest" so it'll always try to match the longest part that it can before the slash.

If you want to match any non-whitespace character you can use \S instead of (\w|\.), which may actually be what you want.

于 2013-03-31T03:28:21.333 回答