TMBTC Post

JSON, Postgres, PDI... Oh My!


JSON POSTGRES PDI

JSON, Postgres, PDI... Oh My!

2017-09-21
json-postgres-pdi

If you are a developer (web, systems, database, whatever) you will, at some point, end up having to interface with an  API / web service. Most of these integrations use JSON to move data back and fourth between your call and the web server. JSON is an acronym for JavaScript Object Notation. With JSON being lite and flexible, I don’t see that trend stoping anytime soon. Here are a few notes about how you can work with JSON using Pentaho's Data Integrator and Postgres.

Not familiar with JSON? Here is a link talking about the standard: http://www.json.org

Here is how I’m handling JSON elements using PDI and Postgres.

Pentaho Data Integration (PDI)

Using PDI, you can make calls to a web service and capture its results or maybe get ahold of JSON files from separate process. Once you have your JSON you will need to treat it a little differently than a traditional text file or csv. PDI provides a 'JSON Input' step that can handle basic JSON structures.


The configuration for this step is similar to the 'Get data from XML’ step except the JSONPath (similar to XPath) is a bit cumbersome. Here is a quick example of a basic JSON snip and how I’ve mapped the keys over to fields in the input step.

The JSON file:

{
  "name" : "UC",
  "description" : "User Communications",
  "domain" : "click.example.com",
  "owners" : [ "aa5661", "bb7420", "cc4211", "dd8850", "ee2580", "ff07804" ],
  "members" : [ "uu349w", "vv8850", "ww2580", "xx07804", "yy5661", "zz7420" ],
  "totalClicks" : 6393664,
  "lastClicked" : "2017-02-02T11:35:14.799-0800",
  "created" : "2016-07-13T14:32:02.907-0700",
  "defaultlinksttl" : 730
}

And here is how I map those fields using the JSONPath


Now you have 10 columns containing values extracted from the JSON. You can treat this data just like you would with a Text file input step. That last field, json_payload using the ‘$’, is calling the root object which will return the full JSON object (everything in that sample above) and place it into a single column. This value is later inserted into a Postgres table. (keep reading)

NOTE: you have to use the 'PostgreSQL Bulk Loader' when loading JSON because the default PDI stream doesn’t have a json datatype. Using the normal ‘Table output’ step will result in an error.

More information regarding the JSON Input step can be found here:
http://wiki.pentaho.com/display/EAI/JSON+Input


Postgres with JSON

Starting in Postgres 9.2, the platform started supporting JSON natively; however, it wasn’t until 9.4 when they introduced Binary JSON (jsonb) thats when things got interesting. To read more about Postgres and the various JSON functions and operators go here:
https://www.postgresql.org/docs/9.4/static/functions-json.html

Once you have the JSON, you need to store it in a table using the jsonb data type. Here is the DDL for the table to hold the data from the example above.

CREATE TABLE api_accounts(
  name character varying(64),
  description character varying(256),
  domain character varying(256),
  owners character varying(2048),
  members character varying(2048),
  totalclicks bigint,
  lastclicked timestamp without time zone,
  created timestamp without time zone,
  defaultlinksttl integer,
  json_payload jsonb
);

Query That Data!
Having the JSON data inside a table makes querying it is easy (provided your JSON is nicely laid out). Using the example JSON from above, to extract the ‘name’ values from the JSON you simply do this:

SELECT json_payload->>'name' AS my_name
FROM public.api_accounts;
And you get this:
  my_name
-----------
 XY Sample
 AB Sample
(2 rows)

You can also use the same methods to filter your results like this:

SELECT name
    ,domain
FROM public.api_accounts
WHERE json_payload->>'name' = 'XY Sample';
Results:
   name    |      domain
-----------+-------------------
 XY Sample | click.example.com
(1 row)

Another fun function is the ability to expand an array from within the JSON object.

SELECT name
     ,jsonb_array_elements_text(json_payload->'owners') AS owners
FROM public.api_accounts;

The jsonb_array_elements_text() function expands the array, creating a new row for each element. The results look like this:

   name    | owners
-----------+---------
 XY Sample | aa5661
 XY Sample | bb7420
 XY Sample | cc4211
 XY Sample | dd8850
 XY Sample | ee2580
 XY Sample | ff07804
 AB Sample | mm5661
 AB Sample | nn7420
 AB Sample | oo4211
 AB Sample | pp8850
(10 rows)

You can also query your JSON to see if various elements exist like this.

SELECT name
     ,domain
FROM public.api_accounts
WHERE json_payload ? 'name';

This will return all rows with a key value of ‘name’ in the JSON.

   name    |      domain
-----------+-------------------
 XY Sample | click.example.com
 AB Sample | click.example.com
(2 rows)

If you have millions of JSON values and your queries are taking to long, you can actually create an INDEX using the JSON operators like this:

CREATE INDEX idx_api_accounts_name
ON public.api_accounts((json_payload->'name'));
There are a bunch of ways you can index the jsonb data types, including some using GIN indexes. To read more, go here: https://www.postgresql.org/docs/9.4/static/textsearch-indexes.html

Next Steps?

Moving forward, when dealing with JSON, I encourage you to parse out what useful and put them into columns and then save the rest of the JSON object into its own column. This give you the ability to quickly pull that useful data and then go back and extract other tidbits as needed.

I look forward to seeing what you can do with this information!

You might also like
simple-bash-script-to-email-server-status

Simple Bash Script To Email Server Status

2012-08-22

I didn't want to constantly have to log into my servers in order to check on key performance indicators so I decided to write a simple script that would do the checking for me. After collecting results, the script emails them to me. There are a few tools called within the script you might need to install. I also convert any tabs into spaces in order to make sure things line up nicely inside my email. #!/bin/bash SERVER="myserver001" TOEMAIL="admin@myservers.com" FROMEMAIL="myserver001@myserverscom" # Who is logged in and what are they up to WHO=`w` #


Read More...

how-to-fix-pagehandlerfactory-integrated-has-a-bad-module-when-setting-up-asp-net

How to fix: “PageHandlerFactory-Integrated” has a bad module when setting up ASP.NET

2012-08-22

I was recently setting up IIS 7.5 on Windows 2008 R2 for an ASP.NET site and came across the following 500 error: Handler "PageHandlerFactory-Integrated" has a bad module "ManagedPipelineHandler" in its module list. After a bit of searching, it turns out ASP.NET was not completely installed with IIS even though I checked that box in the "Add Feature" dialog. I found a number of suggestions but found this command. It fixed my issues and got rid of the error. %windir%\Microsoft.NET\Framework64\v4.0.30319\aspnet_regiis.exe -i For a 32 bit system, use the


Read More...