...making Linux just a little more fun!

PostgreSQL - import of large texts and dynamic content

By Anton Borisov

Introduction

PostgreSQL is an enterprise-level database. It is an open-source software, competes by some features with products from Oracle, so it is no surpise that more and more projects delegate such important duties like data-mining and data handling to PostgreSQL. And what's more important, the architectural design for this database is extremely powerful and conforms to KISS principal very close, so it is really very fun to deal with PGSQL internal programming, as well as maintaining it.

The task

One of the projects that I need to take care of is based on PostgreSQL and Liferay Portal, which in turn is based on Tomcat. Actually, it has a 3-tier architecture, where Tomcat works as a web-server, i.e. it basically is a front-end, that serves requests from around the world. Liferay's web-pages can be as: (a) plain HTML, (b) JSP (Java Server Pages), (c) or can be programmed as servlets (portlets) as well. The latter two scenarios require you to have an IDE (Integrated Development Environment) with JSP-, portlet-, JDBC-bindings deployed. Basically, either JSP, or portlet contains code, that just fetches actual SQL-data from database instance (for example, news_portal) and prepares lovely HTML-page, which shows today's weather forecast, or currency rates. However, you might be interested in generating the same page without doing time consuming efforts like downloading and installing IDE, programming new servlet and deploying it afterall. How to make it happen? Simply execute necessary SQL-requests at backyard, i.e. within operating system space, where Tomcat and PostgreSQL servers reside. You can program it in 10 minutes - in bash, python or any other scripting language. In my case I generate HTML-page, that consists of thousand lines of text, and push it back into Liferay's CMS database engine (news_lportal), so HTML contents of this page is to be displayed by Liferay itself. I also scheduled via cron how often I regenerate this information page, so Liferay would always show up-to-date news, rates, etc.

Data manipulating with PSQL

There's a native client, that comes with PostgreSQL server, called psql. Although psql is console application only it has essentially the same capabilities as it's counterpart - a GUI, GTK-based PgAdmin. If you don't have it installed in your system, please run aptitude (for Debian):

# aptitude install postgresql-client
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Reading extended state information      
Initializing package states... Done
Reading task descriptions... Done  
The following NEW packages will be installed:
  postgresql-client postgresql-client-8.3{a} postgresql-client-common{a} 
0 packages upgraded, 3 newly installed, 0 to remove and 6 not upgraded.
Need to get 2028kB of archives. After unpacking 5276kB will be used.
Do you want to continue? [Y/n/?]

Listing 1. Postgresql package consists of psql as well as other auxiliary utilities

It will install psql, as well as pg* utilities (pg_dump, pg_restore and others).

Surely, you can install a GUI application as well for performing complex tasks, like data analysis:

# aptitude install pgadmin3
PgAdmin

Figure 1. PgAdmin - graphical application for handling SQL-queries

With the help of psql you can quite easily run any SQL-statement, like this:

psql -q -n -h 127.0.0.1 news_lportal -U root -c "select userid, emailaddress from user_"

where used:

	host to connect to - 127.0.0.1
	desired database within PostgreSQL pool - news_lportal
	username, which is granted to execute SQL-command - root
	and SQL-command itself - select userid, emailaddress from user_

Alternatively, you can run psql with update operator, like this:

psql -q -n -h 127.0.0.1 news_lportal -U root -c "update journalarticle set content = '<H1>Hello, World!</H1>' where id_ = 24326"

Where ID with number 24326 is my HTML document previosuly created by CMS-engine on top of Liferay and stored inside PostgreSQL database - news_lportal.

In such a way, you could refresh any information, that is stored inside journalarticle table. The only thing you should remember about is the correct ID for your article.

However, in real life this update trick won't work as it should. I've prepared an update script (import_table.sh), where the contents of table_news.html file should be uploaded into PostgreSQL.

#!/bin/sh
    ct=`cat table_news.html`
    psql -t -l -q -n -h 127.0.0.1 news_lportal -U root -c "update journalarticle set content = '$ct' where id_ = 24326;"

Listing 2. Very simple import script (import_table.sh), first version

Argh! It didn't work out - PostgreSQL client refused to run an update command.

$ ./import_table.sh 
./import_table.sh: line 4: /usr/bin/psql: Argument list too long

At first glance, the file table_news.html seems to be quite good. But a closer look shows another catch-up - the file is a bit too large - 400Kb in size.

$ file table_news.html
table_news.html: UTF-8 Unicode text, with very long lines
$ cat table_news.html | wc
    617    2505  408460

Is there a mechanism to load any text file larger than 2Kb into database? Yes! Luckily, PostgreSQL has import/export functions that will ease communicating with file I/O operations. Let's declare our own procedure get_text_document_portal() that will load into database any text file.

- Function: get_text_document_portal(character varying) 
-- DROP FUNCTION get_text_document_portal(character varying); 
CREATE OR REPLACE FUNCTION get_text_document_portal(p_filename character varying) 
  RETURNS text AS 
$BODY$ SELECT CAST(pg_read_file(E'liferay_import/' || $1 ,0, 100000000) AS TEXT); 
$BODY$ 
  LANGUAGE sql VOLATILE SECURITY DEFINER 
  COST 100; 
ALTER FUNCTION get_text_document_portal(character varying) OWNER TO postgres; 

Listing 3. Our new procedure will call pg_read_file() function and read text file from disk

In order to load a text file into the database named news_lportal, I've written the script below (import_table_2.sh), which takes filename - in this example, table.text - as a parameter for get_text_document_portal() procedure and places it's contents into corresponding field of table journalarticle.

#!/bin/sh
    psql -q -n  -h 127.0.0.1 news_lportal -U root -c "update journalarticle set content = get_text_document_portal('table.text') where id_  = 24326;"

Listing 4. Import script (import_table_2.sh) that triggers our new pgsql-procedure

All you need to do is to change the source HTML file, named table.text and run import_table_2.sh. Please pay attention to a location, where the imported file should be placed - this is a subdirectory liferay_import under /var/lib/postgresql/8.3/main/ tree.

$ ls -l /var/lib/postgresql/8.3/main/
total 48
-rw-------  1 postgres postgres    4 Nov  9 10:20 PG_VERSION
drwx------ 10 postgres postgres 4096 Nov 10 11:16 base
drwx------  2 postgres postgres 4096 Mar  4 16:44 global
drwx------  2 postgres postgres 4096 Dec  3 18:27 liferay_import
drwx------  2 postgres postgres 4096 Nov  9 10:20 pg_clog
drwx------  4 postgres postgres 4096 Nov  9 10:20 pg_multixact
drwx------  2 postgres postgres 4096 Mar  1 13:29 pg_subtrans
drwx------  2 postgres postgres 4096 Nov  9 10:20 pg_tblspc
drwx------  2 postgres postgres 4096 Nov  9 10:20 pg_twophase
drwx------  3 postgres postgres 4096 Mar  4 12:43 pg_xlog
-rw-------  1 postgres postgres  133 Feb 11 22:09 postmaster.opts
-rw-------  1 postgres postgres   53 Feb 11 22:09 postmaster.pid
lrwxrwxrwx  1 root     root       31 Nov  9 10:20 root.crt -> /etc/postgresql-common/root.crt
lrwxrwxrwx  1 root     root       36 Nov  9 10:20 server.crt -> /etc/ssl/certs/ssl-cert-snakeoil.pem
lrwxrwxrwx  1 root     root       38 Nov  9 10:20 server.key -> /etc/ssl/private/ssl-cert-snakeoil.key

Listing 5. Owners' information for PostgreSQL disk storage pool

It is owned by postgres and can be written by this user only. Or, by root account. Of course, you could add an entry into root's crontab, but a good practice is - split jobs between different accounts. Assigning database jobs to postgres only, and every other task trust to, for instance, tomcat account. So how can tomcat user be able to write to liferay_import directory with postgres owner access bits? By making a link - symlink doesn't work, but hardlink will do!

# ln /var/lib/postgresql/8.3/main/liferay_import/table.text	/home/tomcat/db/table.text

Listing 6. Hardlink allows to override owners' limitations provided by symlink

#!/bin/sh

        /home/tomcat/db/prepare_table_news.sh > /home/tomcat/db/table.text
        /home/tomcat/db/import_table_2.sh

Listing 7. Script (mk_db.sh), that prepares arbitrary HTML-document and loads it into database

Hooray! Now I can place an entry into tomcat's crontab and get the news information updated every hour. And this is done from under tomcat account. Really nice.

$ crontab -l
# m h  dom mon dow   command
0 * * * * /home/tomcat/db/mk_db.sh > /dev/null

Listing 8. One entry in tomcat's crontab that should be executed every hour in order to update news

Conclusion

There exist different approaches how to provide up-to-date information when you deal with Liferay Portal and portlet-technology. First way requires to have dedicated developer environment preinstalled (NetBeans IDE with portlets bindings), whilst another way needs only to have a basic shell-scripting knowledge and be able to correctly construct SQL-queries. Of course, the better way is a good full-time developer with hands on IDE, JSR168 / JSR268 portlet standards, that would program whatever web-application you need, especially HTML-page with dynamically changed information. However, you can achieve the same results much quicker - simply rely upon casual Linux console tools.

Resources

[1] http://www.postgresql.org/
[2] "Practical PostgreSQL" - Joshua D. Drake, John C. Worsley, O'Reilly Media


[BIO]

Anton has jumped in into Linux world in 1997, when he first tried a tiny muLinux distribution, being run from a single floppy. Later on, Red Hat and Slackware became his favorite choice. Nowdays, Anton designs. Linux-oriented applications and middleware, and prefers to work with hardware labeled as "Powered by Linux".


Share

Talkback: Discuss this article with The Answer Gang


[BIO]

Anton jumped into Linux world in 1997, when he first tried a tiny muLinux distribution, being run from a single floppy. Later on, Red Hat and Slackware became his favorite choice. Nowdays, Anton designs Linux-oriented applications and middleware, and prefers to work with hardware labeled as "Powered by Linux".


Copyright © 2011, Anton Borisov. Released under the Open Publication License unless otherwise noted in the body of the article.

Published in Issue 186 of Linux Gazette, June 2011

Tux