1

我有一个这样的 CSV 文件:

"","LESCHELLES","","LESCHELLES"
"","SAINTE CROIX DE VERDON","","SAINTE CROIX DE VERDON"
"","SERRE CHEVALIER","","SERRE CHEVALIER"
"","SAINT JUST D'ARDECHE","","SAINT JUST D'ARDECHE"
"","NEUVILLE SUR VANNES","","NEUVILLE SUR VANNES"
"","ESCUEILLENS ET SAINT JUST","","ESCUEILLENS ET SAINT JUST"
"","PAS DES LANCIERS","","PAS DES LANCIERS"
"","PLAN DE CAMPAGNE","","PLAN DE CAMPAGNE"

我想这样转换它:

"","Leschelles","","LESCHELLES"
"","Sainte Croix De Verdon","","SAINTE CROIX DE VERDON","STE CROIX DE VERDON","93"
"","Serre Chevalier","","SERRE CHEVALIER","SERRE CHEVALIER","93"
"","Saint Just D'Ardeche","","SAINT JUST D'ARDECHE"
"","Neuville Sur Vannes","","NEUVILLE SUR VANNES"
"","Escueillens Et Saint Just","","ESCUEILLENS ET SAINT JUST","ESCUEILLENS ET ST JUST","91"
"","Luc","","LUC"
"","Pas Des Lanciers","","PAS DES LANCIERS","PAS DES LANCIERS","93"
"","Plan De Campagne","","PLAN DE CAMPAGNE","PLAN DE CAMPAGNE","93"

这会很好。更好的是:小写所有“完整”单词,如de, d',和. 这将给出:etsurdes

"","Leschelles","","LESCHELLES"
"","Sainte Croix de Verdon","","SAINTE CROIX DE VERDON","STE CROIX DE VERDON","93"
"","Serre Chevalier","","SERRE CHEVALIER","SERRE CHEVALIER","93"
"","Saint Just d'Ardeche","","SAINT JUST D'ARDECHE"
"","Neuville sur Vannes","","NEUVILLE SUR VANNES"
"","Escueillens et Saint Just","","ESCUEILLENS ET SAINT JUST","ESCUEILLENS ET ST JUST","91"
"","Luc","","LUC"
"","Pas des Lanciers","","PAS DES LANCIERS","PAS DES LANCIERS","93"
"","Plan de Campagne","","PLAN DE CAMPAGNE","PLAN DE CAMPAGNE","93"
4

4 回答 4

3

Python有title()

返回字符串的标题版本,其中单词以大写字符开头,其余字符为小写。

该算法使用一个简单的独立于语言的单词定义作为一组连续的字母。该定义在许多情况下都有效,但这意味着缩写和所有格中的撇号形成单词边界,这可能不是预期的结果:

"they're bill's friends from the UK".title() "They'Re Bill'S Friends From The Uk"

可以使用正则表达式构造撇号的解决方法:

 import re
 def titlecase(s):
     return re.sub(r"[A-Za-z]+('[A-Za-z]+)?",
                   lambda mo: mo.group(0)[0].upper() +
                              mo.group(0)[1:].lower(),
                   s)

 titlecase("they're bill's friends.") "They're Bill's Friends."

更新:这是法语问题的解决方案:

import re, sys 

def titlecase(s):
    return re.sub(r"[A-Za-z]+('[A-Za-z]+)?",
        lambda mo: mo.group(0)[0].upper() +
                   mo.group(0)[1:].lower(),
        s)  

def french_parse(s):
    p = re.compile(
        r"( de la | sur | sous | la | de | les | du | le | au | aux | en | des | et )|(( d'| l')([a-z]+))",
        re.IGNORECASE)
    return p.sub(
        lambda mo: mo.group().find("'")>0
                   and mo.group()[:mo.group().find("'")+1].lower() +
                       titlecase(mo.group()[mo.group().find("'")+1:])
                   or (mo.group(0)[0].upper() + mo.group(0)[1:].lower()),
        s); 

for line in sys.stdin:
    s = line[20:len(line)-1]
    p = s.find('"')
    t = s[:p]
    # Just output to show which names have been modified:
    if french_parse( titlecase(t) ) != titlecase(t):
        print '"' + french_parse( titlecase(t) ) + '"'

像这样启动它:

python thepythonscript.py < file.csv

然后输出将是:

"Grenand les Sombernon"
"Touville sur Montfort"
"Fontenay en Vexin"
"Durfort Saint Martin de Sossenac"
"Monclar d'Armagnac"
"Ports sur Vienne"
"Saint Barthelemy de Beaurepaire"
"Saint Bernard du Touvet"
"Rosoy le Vieil"
于 2012-09-17T07:21:02.190 回答
1

虽然您可以使用一些 vim 正则表达式魔术来解决这个问题,但我认为如果您使用您最喜欢的脚本语言解决问题,并使用!命令从 vim 中通过管道选择文本,它会更容易。这是 PHP 中的(未经测试的)示例:

#!/usr/bin/env php
<?php
$specialWords = array('de', 'd\'', 'et', 'du', /* etc. */ );
foreach (file('php://stdin') as $ville) {
    $line = ucwords($line);
    foreach ($specialWords as $w) {
        $line = preg_replace("/\\b$w\\b/i", $w, $line);
    }
    echo $line;
}

使该脚本可执行并将其存储在您的PATH; 然后从 vim 中选择一些文本并用于:'<,'>! yourscript.php转换(或仅:%! yourscript.php用于整个缓冲区)。

于 2012-09-17T07:04:36.460 回答
0

csv.vim ftplugin有助于处理 CSV 文件。虽然它没有直接提供“在 N 列中替换”功能,但它可能会让您接近。至少您可以将列排列成整齐的块,然后对其应用简单的正则表达式或可视块选择。

但我认为使用更适合操作 CSV 文件的不同工具链可能比完全在 Vim 中执行此操作更可取。这还取决于它是一次性任务还是您经常这样做。

于 2012-09-17T08:04:21.353 回答
0

这是一个单行 vim 命令。

%s/"[^"]*",\zs\("[^"]*"\)/\=substitute(substitute(submatch(0), '\<\(\a\)\(\a*\)\>', '\u\1\L\2', 'g'), '\c\<\(de\|d\|l\|sur\|le\|la\|en\|et\)\>', '\L&', 'g')

我希望在前两个字段中没有双引号。

这个解决方案背后的想法是依赖于:h :s\=一旦找到就在第二个字段上执行一系列函数。这一系列的功能是:首先将每个单词改为TitleCase,然后将所有字都小写。

于 2012-09-17T10:06:30.767 回答