parsing - Rust - 如何在 nom 中解析 UTF-8 字母字符？

Question

我正在尝试解析字母字符的字符序列，包括德语变音符号 (ä ö ü) 和 UTF-8 字符集中的其他字母字符。这是我首先尝试的解析器：

named!(
    parse(&'a str) -> Self,
    map!(
        alpha1,
        |s| Self { chars: s.into() }
    )
);

但它仅适用于 ASCII 字母字符 (a-zA-Z)。我尝试通过以下方式执行char解析char：

named!(
    parse(&str) -> Self,
    map!(
        take_while1!(nom::AsChar::is_alpha),
        |s| Self { chars: s.into() }
    )
);

但这甚至不会解析“hello”，而是会导致Incomplete(Size(1))错误：

你如何解析nom中的 UTF-8 字母字符？我的代码片段：

extern crate nom;

#[derive(PartialEq, Debug, Eq, Clone, Hash, Ord, PartialOrd)]
pub struct Word {
    chars: String,
}

impl From<&str> for Word {
    fn from(s: &str) -> Self {
        Self {
            chars: s.into(),
        }
    }
}

use nom::*;
impl Word {
    named!(
        parse(&str) -> Self,
        map!(
            take_while1!(nom::AsChar::is_alpha),
            |s| Self { chars: s.into() }
        )
    );
}


#[test]
fn parse_word() {
    let words = vec![
        "hello",
        "Hi",
        "aha",
        "Mathematik",
        "mathematical",
        "erfüllen"
    ];
    for word in words {
        assert_eq!(Word::parse(word).unwrap().1, Word::from(word));
    }
}

当我运行这个测试时，

cargo test parse_word

我得到：

thread panicked at 'called `Result::unwrap()` on an `Err` value: Incomplete(Size(1))', ...

我知道chars 已经用 Rust 进行了 UTF-8 编码（感谢上帝，全能），但似乎 nom 库的行为不像我预期的那样。我正在使用nom 5.1.0

score 2 · Accepted Answer

首先 nom 5 使用函数进行解析，我建议使用这种形式，因为错误消息更好，代码更清晰。

你的要求很奇怪，你可以把完整的输入变成一个字符串并结束：

impl Word {
    fn parse(input: &str) -> IResult<&str, Self> {
        Ok((
            &input[input.len()..],
            Self {
                chars: input.to_string(),
            },
        ))
    }
}

但我猜你的目的是解析一个单词，所以这里有一个你可以做的例子：

#[derive(PartialEq, Debug, Eq, Clone, Hash, Ord, PartialOrd)]
pub struct Word {
    chars: String,
}

impl From<&str> for Word {
    fn from(s: &str) -> Self {
        Self { chars: s.into() }
    }
}

use nom::{character::complete::*, combinator::*, multi::*, sequence::*, IResult};

impl Word {
    fn parse(input: &str) -> IResult<&str, Self> {
        let (input, word) =
            delimited(space0, recognize(many1_count(none_of(" \t"))), space0)(input)?;
        Ok((
            input,
            Self {
                chars: word.to_string(),
            },
        ))
    }
}

#[test]
fn parse_word() {
    let words = vec![
        "hello",
        " Hi",
        "aha ",
        " Mathematik ",
        "  mathematical",
        "erfüllen ",
    ];
    for word in words {
        assert_eq!(Word::parse(word).unwrap().1, Word::from(word.trim()));
    }
}

您也可以创建一个自定义函数来is_alphabetic()代替， none_of(" \t")但这需要为 nom 生成一个自定义错误，目前在我看来这很烦人。

score 0 · Accepted Answer

在这个Github 问题上，一位贡献者迅速创建了一个库 ( nom-unicode) 来很好地处理这个问题：

use nom_unicode::complete::{alphanumeric1};

impl Word {
    named!(
        parse(&'a str) -> Self,
        map!(
            alphanumeric1,
            |w| Self::new(w)
        )
    );
}

parsing - Rust - 如何在 nom 中解析 UTF-8 字母字符？

2 回答 2

Related

Reference