Bash Script and Node.js Generator for Seeding Postgres Database

February 16th 2020

With the popularity of Node.js and the recent full-stack movement to Javascript, NoSQL databases such as MongoDB have garnished a lot of attention as of late. However, there is still a lot of production applications relying on the performant capabilities of Relational Database Management Systems (RDMS) such as MySQL and Postgres Databases. Using database adapters and ORM such as Sequalize we can utilize both RDMS and the power of a node.js to complete a well-rounded server stack that is performant and scalable.

Today, we’re going to look at using a simple bash script which will act on our Postgres DB to define our database, create tables, and seed our database using a generator script. Our simple project will be composed of:

  • A Bash script that executes our commands using psql (Postgres interactive terminal) and node.js
  • A SQL file that stores our database definitions, and table schema/definitions for our Postgres DB
  • A Node.js generator script that uses the faker library to create a csv file to use in seeding our database

This simple project could easily be used in a Docker image to deploy and scale across remote hosts, but for this tutorial we will assume that everything is done locally. You will need Postgres installed on your local machine and you will need a postgres user with superuser role in order to run some of the Postgres commands in our tutorial. For mac users you can use Homebrew using brew install postgres and start the postgres database as a service to daemonize the process: brew services start postgres. For other OS you will want to consult the Postgres documentation for getting started.

For our tutorial we are going to setup a simple database that has two tables: a users table and a posts table where the users will have one-to-many relationship with posts. We will then seed our database with users and posts using our Bash script and the node.js generator file. Let’s first take a look at what the SQL file will look like that defines our database and table schema:

-- schema.sql
-- Since we might run the import many times we'll drop if exists
DROP DATABASE IF EXISTS blog;

CREATE DATABASE blog;

-- Make sure we're using our `blog` database
\c blog;

-- We can create our user table
CREATE TABLE IF NOT EXISTS user (
  id SERIAL PRIMARY KEY,
  username VARCHAR,
  email VARCHAR
);

-- We can create our post table
CREATE TABLE IF NOT EXISTS post (
  id SERIAL PRIMARY KEY,
  userId INTEGER REFERENCES user(id),
  title VARCHAR,
  content TEXT,
  image VARCHAR,
  date DATE DEFAULT CURRENT_DATE
);

Now that is defined we can test importing our database schema into Postgres using the psql utility:

psql postgres < schema.sql

With any luck you should see the the database and tables being created. You can also do this under a user that has the CREATEDB privilege. Such as:

psql -U superuser postgres < schema.sql

With our schema.sql file working we can now move on to our generator script which can generate seed data that we can then COPY into our database. We will use the faker library to generate this data and we will also look at how to handle large fs.writeStream buffers so we can write large amounts of seed data. We can create our generator.js file and add some required packages to it:

// generator.js
const fs = require('fs')
const faker = require('faker')
const argv = require('yargs').argv

const lines = argv.lines || 1000000
const filename = argv.output || 'posts.csv'
const writeStream = fs.createWriteStream(filename)

We are using the yargs package to handle command line input options for us and if those aren’t defined we’re setting defaults as fallbacks. We’re also initializing our writeStream so we have a reference to it in our generator. Since our output is a csv file we would expect that one line in our posts.csv file would look like:

userId,title,content,image,date
3,First Post Title,Loreum ipsum dolor sed,http://lorempixel.com/640/480/nature,2020-01-15

Note that we’re purposely leaving off our id field because postgres will autogenerate that and auto increment that field for us so that they’re all unique. Let’s define a helper function that generates one line (one post entry) that we can use to write into our csv:

const createPost = () => {
  const userId = faker.random.number(10)
  const title = faker.hacker.phrase()
  const content = faker.lorem.paragraph()
  const image = faker.image.image()
  const date = faker.date.recent()

  return `${userId},${title},${content},${image},${date}\n`
}

Now that we have a function that generates one line for our csv we can work on the function that loops through the number of lines we want to create, calling our createPost() function and writing it to our file. For large streams, such as millions of lines of generated data, we need to handle a case when our buffer is full and listen for the ‘drain’ event to be fired and then we can resume our writing. Let’s stub out our function:

const startWriting = (writeStream, encoding, done) => {
  let i = lines
  function writing(){
    let canWrite = true
    do {
      i--
      //check if i === 0 so we would write and call `done`
      //else call write and continue looping
    } while(i > 0 && canWrite)
    if(i > 0 && !canWrite){
      //our buffer for stream filled and need to wait for drain
      // Write some more once it drains.
      writeStream.once('drain', writing);
    }
  }
  //initiate our writing function
  writing()
}

Alright so we handled the edge case when our writeStream fills up and we wait for the ‘drain’ event to complete and then we continue with our writing function. We need to finish the conditional in our do while loop to check if we have reached i amount of records and fire the callback:

const startWriting = (writeStream, encoding, done) => {
  let i = lines
  function writing(){
    let canWrite = true
    do {
      i--
      let post = createPost()
      //check if i === 0 so we would write and call `done`
      if(i === 0){
        // we are done so fire callback
        writeStream.write(post, encoding, done)
      }else{
        // we are not done so don't fire callback
        writeStream.write(post, encoding)
      }
      //else call write and continue looping
    } while(i > 0 && canWrite)
    if(i > 0 && !canWrite){
      //our buffer for stream filled and need to wait for drain
      // Write some more once it drains.
      writeStream.once('drain', writing);
    }
  }
  writing()
}

Now that we have our startWriting function defined we can finish our generator. Let’s put the whole thing together:

// generator.js
const fs = require('fs')
const faker = require('faker')
const argv = require('yargs').argv

const lines = argv.lines || 1000000
const filename = argv.output || 'posts.csv'
const stream = fs.createWriteStream(filename)

const createPost = () => {
  const userId = faker.random.number(10)
  const title = faker.hacker.phrase()
  const content = faker.lorem.paragraph()
  const image = faker.image.image()
  const date = faker.date.recent()

  return `${userId},${title},${content},${image},${date}\n`
}

const startWriting = (writeStream, encoding, done) => {
  let i = lines
  function writing(){
    let canWrite = true
    do {
      i--
      let post = createPost()
      //check if i === 0 so we would write and call `done`
      if(i === 0){
        // we are done so fire callback
        writeStream.write(post, encoding, done)
      }else{
        // we are not done so don't fire callback
        writeStream.write(post, encoding)
      }
      //else call write and continue looping
    } while(i > 0 && canWrite)
    if(i > 0 && !canWrite){
      //our buffer for stream filled and need to wait for drain
      // Write some more once it drains.
      writeStream.once('drain', writing);
    }
  }
  writing()
}

//write our `header` line before we invoke the loop
stream.write(`userId,title,content,image,date\n`, 'utf-8')
//invoke startWriting and pass callback
startWriting(stream, 'utf-8', () => {
  stream.end()
})

Now we can use our generator.js file from the command line to create however many records we want by calling: node generator.js --lines 2000000 --output posts.csv

We are one step closer to creating a script to automate the database creation and seeding our database. Our last piece that we need to do is write a bash script to call all of these utilities that we created. We will call this seed.sh since it seeds our database. Here we go:

#!/bin/bash
 
###################################################
# Bash script to create database and seed 
###################################################

# Variable Definitions
# Path to directory bash script is living
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

# Database Variable Definitions
DATABASE="blog"
USER="superuser"

# Output Filename for Faker File
OUTPUT="posts.csv"
FILEPATH="$DIR/$OUTPUT"
# if parameter 1 is not passed as argument default records to be generated to 1000000
LINES=${1:-1000000}

### Import Our Database ###
# Dont specify a database since CREATE DATABASE is in schema.sql
SCHEMA="$DIR/schema.sql"
psql -U $USER < $SCHEMA

### Run Our Generator Script ###
node generator.js --output=$FILEPATH --lines=$LINES

### Import Our posts.csv file to seed Database ###
psql -U $USER -d $DATABASE -c "COPY $DATABASE FROM '$FILEPATH' CSV HEADER;

Now we are able call bash seed.sh 1000000 from our project directory and we will DROP/CREATE our database, import the database Schema and create our tables, run our generator.js file to generate our csv of fake data, and finally copy all of the data into our database using psql COPY command. There you have it, a bash script and node.js generator for seeding a postgres database! Until next time, stay curious, stay creative!