Case Studies

Project in Big Data Paradigm

PROJECT IN BIG DATA PARADIGM

Business Goal

Execute monthly scoring models that will suggest an increase or decrease of credit lines for bank clients based on their historical financial data
Make a better decision in credit assignment, taking in account the financial background of each customer, protecting the bank and the customer during credit assignment process
Execute special credit/new credit product analysis for clients in special occasions and specific campaigns (Black Friday, Christmas, etc.) increasing bank loans
This solutions will benefit the bank since it will avoid credit decisions to be taken without having informative data based on the full customer financial background. It also protects the bank customers by avoiding credit lines without perspectives of being paid.

Technological Goal

Infosistema was chosen to implement a solution (ETL process) that gets unstructured data related with financial activity from a bank of Latin America and deliver it to a leader credit score entity with SaaS solution
The criteria’s which guided this choice were th ability of Infosistema to work with big data paradigms and higher capacity to deliver a solution with performance standards to their clients
The capacity to adapt to changing requests, implement solutions using new tools, and the business know-how acquired in banking and insurance ETL during the past years was a plus and also drove the client’s decision in choosing Infosistema

Some project numbers:
Our team worked more than 5000 hours in this project, where 700 of them were fully dedicated to
optimize ETL performance (both in Pentaho Kettle and in Informatica PowerCenter)
In Production the standard routine process executes at the first day of each month
The process pulls from the database one year of financial data from the customers
Data size of each batch process is approximately 80 GB
The ETL performs more than 7000 data manipulations in Informatica PowerCenter
More than 30 files are generated, where the biggest file is a CSV with more than 5000 columns
13 hours to work the biggest financial historical datasets (end-to-end on the ETL process)

Project components

Due to constraints in time-to-market and bank infrastructure availability a first solution to extract >300GB of data for preliminary credit model development was implemented in Pentaho Kettle
A Portal was developed in Java for ad-hoc encryption/decryption of information files between Bank and Scoring Entity
The final solution was implemented in Informatica Powercenter
All personal information is encrypted using AES with 128 bit keys, manipulated and sent to risk score
software (which is offsite)
The development servers have Windows server 2012 – 2,5GHZ – dual core – 32 GB – SSD disk
The production servers have Windows server 2012 – 2,5GHZ – dual core – 64 GB – SSD disk

Replace encrypted strings by sequential tokens to boost Informatica performance; Better hardware usage due to Informatica heavy IO in storing temporary files, usage of SSD disks instead of HDD
Java code optimizations – Usage of pre-initialized in memory hash maps with customized load factors
Applying Java principles for Big data environments – Manage object creation, avoid heavy manipulations, use stringbuilder objects for strings manipulations, question all cycle operations usage, determine functions complexity and avail number of data manipulations
Customization and manipulation of Informatica Powercenter cache size for heavy transformations
Follow Informatica Powercenter tuning best practices – Filter maximum data in the beginning, keep data sorted, avoid unnecessary sorts, joiners and aggregators, manage IO usage
Increase quality and optimize development time by using unitary testing, scripts to evaluate ETL outputs and to generate releases; nightly builds and output generation