Case Studies

Project in Big Data Paradigm

CASE STUDY

PROJECT IN BIG DATA PARADIGM

SOLUTION TO UPDATE CREDIT LINES

Solution Scope – Business

Business Goal

  • Execute monthly scoring models that will suggest an increase or decrease of credit lines for bank clients based on their historical financial data
  • Make a better decision in credit assignment, taking in account the financial background of each customer, protecting the bank and the customer during credit assignment process
  • Execute special credit/new credit product analysis for clients in special occasions and specific campaigns (Black Friday, Christmas, etc.) increasing bank loans
  • This solutions will benefit the bank since it will avoid credit decisions to be taken without having informative data based on the full customer financial background. It also protects the bank customers by avoiding credit lines without perspectives of being paid.
Solution Scope – Technology

Technological Goal

  • Infosistema was chosen to implement a solution (ETL process) that gets unstructured data related with financial activity from a bank of Latin America and deliver it to a leader credit score entity with SaaS solution
  • The criteria’s which guided this choice were th ability of Infosistema to work with big data paradigms and higher capacity to deliver a solution with performance standards to their clients
  • The capacity to adapt to changing requests, implement solutions using new tools, and the business know-how acquired in banking and insurance ETL during the past years was a plus and also drove the client’s decision in choosing Infosistema
Hardfacts
  • Some project numbers:
  • Our team worked more than 5000 hours in this project, where 700 of them were fully dedicated to
    optimize ETL performance (both in Pentaho Kettle and in Informatica PowerCenter)
  • In Production the standard routine process executes at the first day of each month
  • The process pulls from the database one year of financial data from the customers
  • Data size of each batch process is approximately 80 GB
  • The ETL performs more than 7000 data manipulations in Informatica PowerCenter
  • More than 30 files are generated, where the biggest file is a CSV with more than 5000 columns
  • 13 hours to work the biggest financial historical datasets (end-to-end on the ETL process)
Products and Architecture

Project components

  • Due to constraints in time-to-market and bank infrastructure availability a first solution to extract >300GB of data for preliminary credit model development was implemented in Pentaho Kettle
  • A Portal was developed in Java for ad-hoc encryption/decryption of information files between Bank and Scoring Entity
  • The final solution was implemented in Informatica Powercenter
  • All personal information is encrypted using AES with 128 bit keys, manipulated and sent to risk score
    software (which is offsite)
  • The development servers have Windows server 2012 – 2,5GHZ – dual core – 32 GB – SSD disk
  • The production servers have Windows server 2012 – 2,5GHZ – dual core – 64 GB – SSD disk

Final Process

Solution Optimizations
  • Replace encrypted strings by sequential tokens to boost Informatica performance; Better hardware usage due to Informatica heavy IO in storing temporary files, usage of SSD disks instead of HDD
  • Java code optimizations – Usage of pre-initialized in memory hash maps with customized load factors
  • Applying Java principles for Big data environments – Manage object creation, avoid heavy manipulations, use stringbuilder objects for strings manipulations, question all cycle operations usage, determine functions complexity and avail number of data manipulations
  • Customization and manipulation of Informatica Powercenter cache size for heavy transformations
  • Follow Informatica Powercenter tuning best practices – Filter maximum data in the beginning, keep data sorted, avoid unnecessary sorts, joiners and aggregators, manage IO usage
  • Increase quality and optimize development time by using unitary testing, scripts to evaluate ETL outputs and to generate releases; nightly builds and output generation

Solution Architecture

infosistema Inet