0

Consider the following "tweets" (left) and "retweets" (right) tables:

  +----------+-----------------+     +----------+----+
  | tweet_id |  text           |     | tweet_id | rt |
  +----------+-----------------+     +----------+----+
  |  1       | foo {RT|123} bar|     |  1       | 123|
  |  2       | foobar          |     |  3       | 456|
  |  3       | {RT|456} baz    |     |  4       | 789|
  |  4       | bazbar {RT|789} |     +----------+----+
  |  5       | bar baz         |
  +----------+-----------------+

The tweets table contains millions of preprocessed tweets. In some tweets, a custom label is added of the form {RT|xx} with xx being a 17 to 20 figure number. The retweets table is currently empty, but it needs to be filled as demonstrated: tweets.text should be scanned for {RT|xx} labels, and if found, the number should be extracted from the label and inserted into the retweets table together with the tweet_id.

To do this, I started off with selecting all tweets that have {RT}-labels:

SELECT * FROM tweets WHERE `text` LIKE '%{RT|%'

A second step would be to loop through the resultset in PHP and filter the number from the label using a regular expression, and then perform an INSERT INTO operation. This, however, would take a lot of time - making me wonder if this would perhaps be faster with a SQL query? And if so, what would the query have to look like? I have never worked with regular expressions in SQL statements before.

4

3 回答 3

1

也许像这样(未经测试);

SELECT SUBSTR(
    `text`,
    LOCATE('{RT|', `text`) + 4,
    LOCATE('}', `text`, LOCATE('{RT|', text) )
)
FROM `tweets`
WHERE `text` LIKE '%{RT|%';
于 2012-04-06T14:01:17.373 回答
1

如果您的数据库是 MySQL,您可以使用简单的查询来完成:

INSERT INTO `retweets` SELECT id, SUBSTR(`text`, LOCATE('{RT|', `text`)+4, LOCATE('}', `text`) - LOCATE('{RT|', `text`)-4) AS `num` FROM `tweets` HAVING `num` REGEXP '^[0-9]+$';
于 2012-04-06T14:36:45.430 回答
0

这将在 oracle 中工作:

SELECT tweet_id, REGEXP_SUBSTR(REGEXP_SUBSTR(text, '[{RT|][^}]+'), '[[:digit:]]+') FROM tweets WHERE text LIKE '%{RT|%'
于 2012-04-06T14:14:56.010 回答