Saturday, July 9, 2022

Kuwait Data Bank

Introduction

 I've been pitching the idea of Kuwait National Datacenter to government officials and parliament members since 2017, in a holistic manner, covering migration of applications, learning center, audit requirements, security, ...etc. -- unfortunately, those efforts fell on deaf ears.

Fast forward to 2021, and I got the chance to work with a fantastic group of volunteers for Kuwait Foundation for  the Advancement of Science (KFAS) to create something more specific: Kuwait Data Bank -- an entity that would hold data from all of Kuwait's government entities and government-owned companies, to do Data Analytics and Data Science.

The group of volunteers comprised of multiple disciplines; experts in law, business management, organizational structure, information technology and security. A friend of mine & I were covering the information technology (IT) and security aspects.

KFAS gave us 1.5 months, that we extended to 2.5 months max to get the initial draft out, and we were done in Oct or Nov 2021. We're now in discussion with KFAS to see how to proceed & hopefully we get to see this project go live at some point!


Project Scope and Goals

We've checked regional and international open data projects, and almost all had very limited sets of samples of data, over inconsistent timespans, and sometimes one time shot kind of data only. Our project's aims are ambitious and exceed anything we've checked.

  1. Initially, we'll focus on 1-5 critical reports to the Council of Ministers for decision making support.
  2. Gradually, as we sanitize data, and find a source with the most truthful data (or combined sources), then we aim to make data pulling and reporting mainstream and real-time.
  3. Data and reports will be available/accessible in this order:
    1. Council of Ministers
    2. Expand access slowly to government entities in need of help in accurate decision making
    3. Universities in Kuwait
    4. Public access inside of Kuwait
    5. International access to data and/or reports or reporting services
  4. Leverage latest technologies of graphics card acceleration and Massive Parallel Processing (MPP) databases in software (non-appliance) to keep things agile and portable.

Data Access & Analysis Methodology

  1. Start slow with as few sources of data as possible to deliver the critical reports
  2. Deploy data masking & replication connectors to the various databases at the sources
  3. Anonymize data at the source, then replicate to our organization's repository/repositories
  4. Sanitize data and compare accuracy with help from people at each data source, initally
  5. Run Machine Learning models on highly parallelized data access databases
  6. Produce reports or dashboards with results of multiple ML models and compare results
  7. Initially, those reports will be private and delivered only to the Council of Ministers or KFAS, but gradually, the platform will expand to allow real-time access to reports, and then later, our anonymized data sources
  8. Data access & reporting may be monetized to help the platform grow and become self-sustaining, in addition to providing services for companies to run analytics on their data, or using our data sets.
  9. Legal aspects of data access, anonymization & privacy, and cooperation from government entities have been addressed in our report/proposal, but I'll not get into that here.

Privacy & Anonymity

  1. A primary design aspect is to respect privacy and anonymize data at the source, before it's sent to our repositories/databases
  2. Example: if we're to take everyone's full address, we'd remove the house number, but keep the area, and area's block number
  3. If our systems get compromised, there will be no personally identifiable information (PII) that would cause personal risks
  4. We believe that leaving the data masking (anonymization) in the hands of each government entity giving us access is probably the best approach, so that we will never be able to make changes to what data we receive, without manual intervention from the data sources (government entities)

There's a lot more to the project, but I'll stop here and then maybe revise things once we see how the project will move later.

It's an ambitious project, which is why we need to grow gradually and cater for specific needs that help the country's decision makers in making critical decisions and answering crucial questions, before making a decision.