7

我们正在构建用于从网络中挖掘信息的工具。我们有几件,例如

  • 从网络抓取数据
  • 根据模板和业务规则提取信息
  • 将结果解析到数据库
  • 应用规范化和过滤规则
  • 等等等等。

问题在于解决问题并对每个阶段发生的事情有一个很好的“高级图片”。

哪些技术帮助您理解和管理复杂的流程?

  • 使用工作流工具,如 Windows Workflow Foundation
  • 将单独的函数封装到命令行工具中并使用脚本工具将它们链接在一起
  • 编写一种领域特定语言 (DSL) 来指定在更高级别上应该发生的事情的顺序。

只是好奇您如何处理具有许多交互组件的系统。我们希望记录/了解系统如何在比跟踪源代码更高的级别上工作。

4

8 回答 8

3

我使用 AT&T 著名的Graphviz,它很简单,而且效果很好。它也是 Doxygen 使用的同一个库。

此外,如果您付出一点努力,您可以获得非常漂亮的图表。

忘了提一下,我使用它的方式如下(因为 Graphviz 解析 Graphviz 脚本),我使用替代系统以 Graphviz 格式记录事件,所以我只解析 Logs 文件并得到一个漂亮的图表。

于 2008-11-20T01:41:31.613 回答
2

代码说明了每个阶段会发生什么。使用 DSL 将是一个福音,但如果它以编写自己的脚本语言和/或编译器为代价,则可能不会。

Higher level documentation should not include details of what happens at each step; it should provide an overview of the steps and how they relate together.

Good tips:

  • Visualize your database schema relations.
  • Use visio or other tools (like the one you mentioned - haven't used it) for process overviews (imho it belongs to the specification of your project).
  • Make sure your code is properly structured / compartmentalized / etc.
  • Make sure you have some sort of project specification (or some other "general" documentation that explains what the system does on an abstract level).

I wouldn't recommend building command-line tools unless you actually have a use for them. No need in maintaining tools you don't use. (That's not the same as saying it can't be useful; but most of what you do sounds more like it belongs in a library rather than executing external processes).

于 2008-11-20T01:46:43.317 回答
1

My company writes functional specifications for each major component. Each spec follows a common format, and uses various diagrams and pictures as appropriate. Our specs have a functional part and a technical part. The functional part describes what the component does at a high-level (why, what goals it solves, what it does not do, what it interacts with, external documents that are related, etc.). The technical part describes the most important classes in component and any high-level design patterns.

We prefer text because is the most versatile and easy to update. This is a big deal -- not everyone is an expert (or even decent) at Visio or Dia, and that can be a obstacle to keeping the documents up-to-date. We write the specs on a wiki so that we can easily link between each specification (as well as track changes) and allows for a non-linear walk though the system.

For an argument from authority, Joel recommends Functional Specs here and here.

于 2008-11-20T02:12:00.710 回答
1

I find a dependency structure matrix a helpful way to analyze the structure of an application. A tool like lattix could help.

Depending on your platform and toolchain there are many really useful static analysis packages that could help you to document the relationships between subsystems or components of your application. For the .NET platform, NDepend is a good example. There are many other for other platforms though.

Having a good design or model before building the system is the best way to have an understanding for the team of how the application should be structured but tools like those I mentioned can help enforce architectural rules and will often give you insights into the design that just trawling through the code cannot.

于 2008-11-21T18:30:59.190 回答
1

I wouldn't use any of the tools you mentioned.

You need to draw a high-level diagram (I like pencil and paper).

I would design a system that has different modules doing different things, it would be worthwhile do design this so that you can have many instances of every module running in parallel.

I would think about using multiple queues for

  • URLs to Crawl
  • Crawled pages from the web
  • Extracted information based on templates & business rules
  • Parsed results
  • normalizationed & filtered results

You would have simple (probably command-line with no UI) programs that would read data from the queues and insert data into one or more queues (The Crawler would feed both the "URLs to Crawl" and "Crawled pages from the web"), You could use:

  • A web crawler
  • A data extractor
  • A parser
  • A normalizer and filterer

These would fit between the queues, and you could run many copies of these on separate PCs, allowing this to scale.

The last queue could be fed to another program that actually posts everything into a database for actual use.

于 2008-11-21T20:15:32.270 回答
0

Top down design helps a lot. One mistake I see is making the top down design sacred. Your top level design needs to be reviewed and update just like any other section of code.

于 2008-11-21T18:34:38.107 回答
0

It's important to partition these components throughout your software development life cycle - design time, development time, testing, release and runtime. Just drawing a diagram isn't enough.

I have found that adopting a microkernel architecture can really help "divide and conqure" this complexity. The essence of the microkernel architecture is:

  • Processes (each component runs in an isolated memory space)
  • Threads (each component runs on a separate thread)
  • Communication (components communicate through a single, simple message passing channel)

I have written a fairly complex batch processing systems which sound similar to your system using:

Each component maps to .NET executable Executable lifetimes are managed through Autosys (all on the same machine) Communication takes place through TIBCO Rendezvous

If you can use a toolkit that provides some runtime introspection, even better. For example, Autosys lets me see what processes are running, what errors have occurred while TIBCO lets me inspect message queues at runtime.

于 2008-11-22T16:30:45.453 回答
0

I like to use NDepend to reverse engineer complex .NET code base. The tool comes with several great visualization features like:

Dependency Graph: alt text

Dependency Matrix: alt text

Code metric visualization through treemaping: alt text

于 2010-10-18T17:57:37.780 回答