sql-info.de
March 8, 2013

PostgreSQL | Capture and store a Twitter search in a single SQL query using twitter_fdw

Note: following changes to the Twitter API, the Twitter FDW no longer works.

One of the many things I've been wanting to do with this site is add a Planet PostgreSQL feed, however as I've mentioned previously this is a custom application and while knocking together a feed reader is pretty routine stuff, it's not going to leave me any more enlightened than I was before.

However recently at the PostgreSQL "Unconference" in Tokyo, one of the talks was by Hitoshi Harada who demonstrated twitter_fdw, and it occurred to me that as Planet PostgreSQL twitters the updates to its own Twitter account, it might be simple to grab the feed that way.

And it is - follow the instructions in the README.md file, execute a query like the below and back comes a list of recent tweets - no setup, login, API key etc. required.

tgg_devel=# SELECT created_at, SUBSTR(text, 0, 72) || '...' FROM twitter WHERE q='from:planetpostgres';
     created_at      |                                  ?column?                                  
---------------------+----------------------------------------------------------------------------
 2013-03-01 18:16:09 | Joshua Drake: Remembering to check the docs: Autovacuum http://t.co/hJU...
 2013-02-28 13:46:23 | Hubert 'depesz' Lubaczewski: Waiting for 9.3 – Add support for piping C...
 2013-02-28 10:16:07 | Chris Travers: The LedgerSMB Service Locator for UDF's:  How it Works h...
 2013-02-28 01:46:32 | Jeff Frost: Script to add the http://t.co/2NBk3V60Ll respository on Deb...
 2013-02-27 01:16:29 | Michael Paquier: Postgres 9.3 feature highlight: pg_xlogdump http://t.c...
 2013-02-27 00:47:28 | David Wheeler (@theory): Sqitch on Windows (and Linux, Solaris, and OS ...
 2013-02-26 12:16:34 | Dimitri Fontaine: HyperLogLog Unions http://t.co/yTZU6BZGdB...
 2013-02-25 22:16:22 | Andrew Dunstan: A few notes from the field on git http://t.co/RdDeEOK72...
 2013-02-25 15:16:32 | Hubert 'depesz' Lubaczewski: Variables in SQL, what, how, when? http://...
 2013-02-25 10:46:31 | Dimitri Fontaine: PostgreSQL HyperLogLog http://t.co/8CpSUcrQiL...
 2013-02-23 17:31:23 | Andrew Dunstan: Learning new technology http://t.co/CyD8QYmVYa...
(11 rows)

It would of course be insane to execute this query live during dynamic page creation, so I set up a cronjob to run every few minutes to poll Twitter using twitter_fdw which populates this table:

CREATE TABLE tweet_cache (
  twitter_id BIGINT NOT NULL PRIMARY KEY,
  twitter_timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  query TEXT NOT NULL,
  from_user TEXT NOT NULL,
  text TEXT NOT NULL,
  source TEXT
)

It executes a single query to populate/update the table:

WITH new_tweets AS (
  SELECT from_user, text, source, 
         created_at AS twitter_timestamp,
         id AS twitter_id,
         'from:planetpostgres' AS query
    FROM twitter
   WHERE q = 'from:planetpostgres'
)
INSERT INTO tweet_cache
      (twitter_id, query, from_user, text, source, twitter_timestamp)
SELECT twitter_id, query, from_user, text, source, twitter_timestamp
  FROM new_tweets nt
 WHERE NOT EXISTS (
         SELECT NULL FROM tweet_cache tc WHERE tc.twitter_id = nt.twitter_id
       )

(I'm storing the Twitter API query string as I might want to store tweets from different sources so need a way of differentiating them).

And the output - lightly parsed and formatted by the CMS - looks like this:

Planet PostreSQL feed as retrieved from cache table

(The CMS will insert the current version of the cache so you'll always see the most recent Planet PostgreSQL entries here)

The main disadvantage of using this method as an alternative to grabbing the RSS feed directly is that Twitter links to its own URL shortener, which redirects to the postgr.es URL shortener, which finally redirects to the blog post's canonical URL. However that's not a reflection on twitter_fdw, which works as advertised.

A couple of notes :

  • After installation it's necessary to grant permission on the twitter foreign table to any users who require access to it
  • If the Twitter API is not available, twitter_fdw returns an empty set
  • Only the most recent tweets are returned, the total number varies but with @planetpostgres, FDW outputs between 10 and 13 rows. I presume that is determined by the Twitter API, not twitter_fdw.
    Addendum: comment in the twitter_fdw source says  /* API returns at most 15 results by default */

Finallty, to access RSS feeds from PostgreSQL, the Multicorn RSS Foreign Data Wrapper is probably better suited to the task (though I don't - yet - know if it can fetch, parse and store the feed in a single statement).

Posted at 3:29 PM

Comments
Regarding you note concerning the Multicorn RSS FDW, here is what you can do:

CREATE EXTENSION multicorn;

CREATE SERVER rss_srv foreign data wrapper multicorn options (
wrapper 'multicorn.rssfdw.RssFdw'
);

CREATE FOREIGN TABLE postgresqlrss (
"pubDate" timestamp,
description character varying,
title character varying,
link character varying,
guid character varying
) server rss_srv options (
url 'http://planet.postgresql.org/rss20.xml'
);

CREATE TABLE rss_cache (like postgresqlrss);
ALTER TABLE rss_cache add primary key (guid);

INSERT INTO rss_cache (
SELECT *
FROM postgresqlrss r
WHERE not exists (select 1
FROM rss_cache c
WHERE c.guid = r.guid
)
);

9.3 materialized views should be perfect for maintaining such a cache.
Posted by: Ronan Dunklau | 2013-03-08 07:36
Thanks for that, I'll write it up in one of the next posts.
Posted by: Ian Barwick | 2013-03-08 09:35