Big Data Business Intelligence for Govt. Agencies培训
Each session is 2 hours
Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Govt.
Case Studies from NIH, DoE
Big Data adaptation rate in Govt. Agencies & and how they are aligning their future operation around Big Data Predictive Analytics
Broad Scale Application Area in DoD, NSA, IRS, USDA etc.
Interfacing Big Data with Legacy data
Basic understanding of enabling technologies in predictive analytics
Data Integration & Dashboard visualization
Fraud management
Business Rule/ Fraud detection generation
Threat detection and profiling
Cost benefit analysis for Big Data implementation
Day-1: Session-2 : Introduction of Big Data-1
Main characteristics of Big Data-volume, variety, velocity and veracity. MPP architecture for volume.
Data Warehouses – static schema, slowly evolving dataset
MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica etc.
Hadoop Based Solutions – no conditions on structure of dataset.
Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
Batch- suited for analytical/non-interactive
Volume : CEP streaming data
Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
Less production ready – Storm/S4
NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database
Day-1 : Session -3 : Introduction to Big Data-2
NoSQL solutions
KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
KV Store (Hierarchical) - GT.m, Cache
KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
Tuple Store - Gigaspaces, Coord, Apache River
Object Database - ZopeDB, DB40, Shoal
Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to Data Cleaning issue in Big Data
RDBMS – static structure/schema, doesn’t promote agile, exploratory environment.
NoSQL – semi structured, enough structure to store data without exact schema before storing data
Data cleaning issues
Day-1 : Session-4 : Big Data Introduction-3 : Hadoop
When to select Hadoop?
STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB)
Warehousing data = HUGE effort and static even after implementation
For variety & volume of data, crunched on commodity hardware – HADOOP
Commodity H/W needed to create a Hadoop Cluster
Introduction to Map Reduce /HDFS
MapReduce – distribute computing over multiple servers
HDFS – make data available locally for the computing process (with redundancy)
Data – can be unstructured/schema-less (unlike RDBMS)
Developer responsibility to make sense of data
Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: universe of Big Data Tools-which one to use and when?
Hadoop vs. Other NoSQL solutions
For interactive, random access to data
Hbase (column oriented database) on top of Hadoop
Random access to data but restrictions imposed (max 1 PB)
Not good for ad-hoc analytics, good for logging, counting, time-series
Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
Flume – Stream data (e.g. log data) into HDFS
Day-2: Session-2: Big Data Management System
Moving parts, compute nodes start/fail :ZooKeeper - For configuration/coordination/naming services
Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
In Cloud : Whirr
Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI :
Introduction to Machine learning
Learning classification techniques
Bayesian Prediction-preparing training file
Support Vector Machine
KNN p-Tree Algebra & vertical mining
Neural Network
Big Data large variable problem -Random forest (RF)
Big Data Automation problem – Multi-model ensemble RF
Automation through Soft10-M
Text analytic tool-Treeminer
Agile learning
Agent based learning
Distributed learning
Introduction to Open source Tools for predictive analytics : R, Rapidminer, Mahut
Day-2: Session-4 Predictive analytics eco-system-2: Common predictive analytic problems in Govt.
Insight analytic
Visualization analytic
Structured predictive analytic
Unstructured predictive analytic
Threat/fraudstar/vendor profiling
Recommendation Engine
Pattern detection
Rule/Scenario discovery –failure, fraud, optimization
Root cause discovery
Sentiment analysis
CRM analytic
Network analytic
Text Analytics
Technology assisted review
Fraud analytic
Real Time Analytic
Day-3 : Sesion-1 : Real Time and Scalable Analytic Over Hadoop
Why common analytic algorithms fail in Hadoop/HDFS
Apache Hama- for Bulk Synchronous distributed computing
Apache SPARK- for cluster computing for real time analytic
CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation
Day-3: Session-2: Tools for eDiscovery and Forensics
eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
Predictive coding and technology assisted review (TAR)
Live demo of a Tar product ( vMiner) to understand how TAR works for faster discovery
Faster indexing through HDFS –velocity of data
NLP or Natural Language processing –various techniques and open source products
eDiscovery in foreign languages-technology for foreign language processing
Day-3 : Session 3: Big Data BI for Cyber Security –Understanding whole 360 degree views of speedy data collection to threat identification
Understanding basics of security analytics-attack surface, security misconfiguration, host defenses
Network infrastructure/ Large datapipe / Response ETL for real time analytic
Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data
Day-3: Session 4: Big Data in USDA : Application in Agriculture
Introduction to IoT ( Internet of Things) for agriculture-sensor based Big Data and control
Introduction to Satellite imaging and its application in agriculture
Integrating sensor and image data for fertility of soil, cultivation recommendation and forecasting
Agriculture insurance and Big Data
Crop Loss forecasting
Day-4 : Session-1: Fraud prevention BI from Big Data in Govt-Fraud analytic:
Basic classification of Fraud analytics- rule based vs predictive analytics
Supervised vs unsupervised Machine learning for Fraud pattern detection
Vendor fraud/over charging for projects
Medicare and Medicaid fraud- fraud detection techniques for claim processing
Travel reimbursement frauds
IRS refund frauds
Case studies and live demo will be given wherever data is available.
Day-4 : Session-2: Social Media Analytic- Intelligence gathering and analysis
Big Data ETL API for extracting social media data
Text, image, meta data and video
Sentiment analysis from social media feed
Contextual and non-contextual filtering of social media feed
Social Media Dashboard to integrate diverse social media
Automated profiling of social media profile
Live demo of each analytic will be given through Treeminer Tool.
Day-4 : Session-3: Big Data Analytic in image processing and video feeds
Image Storage techniques in Big Data- Storage solution for data exceeding petabytes
LTFS and LTO
GPFS-LTFS ( Layered storage solution for Big image data)
Fundamental of image analytics
Object recognition
Image segmentation
Motion tracking
3-D image reconstruction
Day-4: Session-4: Big Data applications in NIH:
Emerging areas of Bio-informatics
Meta-genomics and Big Data mining issues
Big Data Predictive analytic for Pharmacogenomics, Metabolomics and Proteomics
Big Data in downstream Genomics process
Application of Big data predictive analytics in Public health
Big Data Dashboard for quick accessibility of diverse data and display :
Integration of existing application platform with Big Data Dashboard
Big Data management
Case Study of Big Data Dashboard: Tableau and Pentaho
Use Big Data app to push location based services in Govt.
Tracking system and management
Day-5 : Session-1: How to justify Big Data BI implementation within an organization:
Defining ROI for Big Data implementation
Case studies for saving Analyst Time for collection and preparation of Data –increase in productivity gain
Case studies of revenue gain from saving the licensed database cost
Revenue gain from location based services
Saving from fraud prevention
An integrated spreadsheet approach to calculate approx. expense vs. Revenue gain/savings from Big Data implementation.
Day-5 : Session-2: Step by Step procedure to replace legacy data system to Big Data System:
Understanding practical Big Data Migration Roadmap
What are the important information needed before architecting a Big Data implementation
What are the different ways of calculating volume, velocity, variety and veracity of data
How to estimate data growth
Case studies
Day-5: Session 4: Review of Big Data Vendors and review of their products. Q/A session:
Accenture
APTEAN (Formerly CDC Software)
Cisco Systems
Cloudera
Dell
EMC
GoodData Corporation
Guavus
Hitachi Data Systems
Hortonworks
HP
IBM
Informatica
Intel
Jaspersoft
Microsoft
MongoDB (Formerly 10Gen)
MU Sigma
Netapp
Opera Solutions
Oracle
Pentaho
Platfora
Qliktech
Quantum
Rackspace
Revolution Analytics
Salesforce
SAP
SAS Institute
Sisense
Software AG/Terracotta
Soft10 Automation
Splunk
Sqrrl
Supermicro
Tableau Software
Teradata
Think Big Analytics
Tidemark Systems
Treeminer
VMware (Part of EMC)