我正在尝试编写一个 XML 解析器来解析大学在给定日历年和学期的情况下提供的所有课程。特别是,我试图获取部门的首字母缩写词(即金融的 FIN 等)、课程编号(即数学 415、415 将是数字)、课程名称以及课程值得学分的数量。
我试图解析的文件可以在这里找到
编辑和更新
在读者更深入地了解 XML 解析以及优化它的最佳方法后,我偶然发现了这篇博客POST
假设该文章中运行的测试结果既诚实又准确,似乎 XmlReader 的性能远远优于 XDocument 和 XmlDocument,这验证了以下出色答案中所说的内容。话虽如此,我使用 XmlReader 重新编码了我的解析器类,同时限制了单个方法中使用的读取器数量。
这是新的解析器类:
public void ParseDepartments()
{
// Create reader for the given calendar year and semester xml file
using (XmlReader reader = XmlReader.Create(xmlPath)) {
reader.ReadToFollowing("subjects"); // Navigate to the element 'subjects'
while (!reader.EOF) {
string pth = reader.GetAttribute("href"); // Get department's xml path
string acro = reader.GetAttribute("id"); // Get the department's acronym
reader.Read(); // Read through current element, ensures we visit each element
if (acro != null && acro != string.Empty) { // If the acronym is valid, add it to the department list
deps.AddDepartment(acro, pth);
}
}
}
}
public void ParseDepCourses()
{
// Loop through all the departments, and visit there respective xml file
foreach (KeyValuePair<string, string> department in deps.DepartmentPaths) {
try {
using (XmlReader reader = XmlReader.Create(department.Value)) {
reader.ReadToFollowing("courses"); // Navigate to the element 'courses'
while (!reader.EOF) {
string pth = reader.GetAttribute("href");
string num = reader.GetAttribute("id");
reader.Read();
if (num != null && num != string.Empty) {
string crseName = reader.Value; // reader.Value is the element's value, i.e. <elementTag>Value</elementTag>
deps[department.Key].Add(new CourseObject(num, crseName, termID, pth)); // Add the course to the department's course list
}
}
}
} catch (WebException) { } // WebException is thrown (Error 404) when there is no xml file found, or in other words, the department has no courses
}
}
public void ParseCourseInformation()
{
Regex expr = new Regex(@"^\S(L*)\d\b|^\S(L*)\b|^\S\d\b|^\S\b", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace); // A regular expression that will check each section and determine if it is a 'Lecture' section, at which point, that section's xml file is visited, and instructor added
foreach (KeyValuePair<string, Collection<CourseObject>> pair in deps) {
foreach (CourseObject crse in pair.Value) {
try {
using (XmlReader reader = XmlReader.Create(crse.XmlPath)) {
reader.ReadToFollowing("creditHours"); // Get credit hours for the course
crse.ParseCreditHours(reader.Value); // Class method to parse the string and grab the correct integer values
reader.ReadToFollowing("sections"); // Navigate to the element 'sections'
while (!reader.EOF) {
string pth = reader.GetAttribute("href");
string crn = reader.GetAttribute("id");
reader.Read();
if (crn != null && crn != string.Empty) {
string sction = reader.Value;
if (expr.IsMatch(sction)) { // Check if sction is a 'Lecture' section
using (XmlReader reader2 = XmlReader.Create(pth)) { // Navigate to its xml file
reader2.ReadToFollowing("instructors"); // Navigate to the element 'instructors'
while (!reader2.EOF) {
string firstName = reader2.GetAttribute("firstName");
string lastName = reader2.GetAttribute("lastName");
reader2.Read();
if ((firstName != null && firstName != string.Empty) && (lastName != null && lastName != string.Empty)) { // Check and make sure its a valid name
string instr = firstName + ". " + lastName; // Concatenate into full name
crse.AddSection(pth, sction, crn, instr); // Add section to course
}
}
}
}
}
}
}
} catch (WebException) { } // No course/section information found
}
}
}
尽管此代码的执行需要相当长的时间(10-30 分钟之间的任何时间),但考虑到要解析的大量数据,这是可以预期的。感谢所有发布答案的人,非常感谢。我希望这对可能有类似问题/疑问的任何其他人有所帮助。
谢谢,
大卫