I wouldn't use any of the tools you mentioned.
You need to draw a high-level diagram (I like pencil and paper).
I would design a system that has different modules doing different things, it would be worthwhile do design this so that you can have many instances of every module running in parallel.
I would think about using multiple queues for
- URLs to Crawl
- Crawled pages from the web
- Extracted information based on templates & business rules
- Parsed results
- normalizationed & filtered results
You would have simple (probably command-line with no UI) programs that would read data from the queues and insert data into one or more queues (The Crawler would feed both the "URLs to Crawl" and "Crawled pages from the web"), You could use:
- A web crawler
- A data extractor
- A parser
- A normalizer and filterer
These would fit between the queues, and you could run many copies of these on separate PCs, allowing this to scale.
The last queue could be fed to another program that actually posts everything into a database for actual use.