Master Information Systems

An Information System is a group is a group of related components that function cohesively to achieve a specified goal. The components gather, process and distribute data and information. Example of information systems would be the following;

Information systems used by insurance companies to store and process information relating to customers. ATMs and POSs used in grocery stores by shop assistants. Information used by Human Resources in companies to store and process information relating to their employees.

information system

Activities:

Input: Raw data

Processing: Transforming raw data into useful information

Output: Move processed information to people/ activities.

Feedback/ Control: Output passed back to improve process/ inputs

Data can be either qualitative or quantitative. The processing of data involves it being classified, sorted, aggregated, calculations performed on it and then selecting required data.

Information is the ultimate output of the process and is data that has significance in regard to the context it was processed.

Good information should have the following qualities;

  • Relevant
  • Complete
  • Accurate
  • Clear
  • Consistent
  • Reliable
  • Communicable to the right person
  • Be a manageable volume
  • Timely
  • Furnished at a cost lesser than the value of it’s benefits

Information is considered to have three dimensions;

  • Time – Timeliness, currency, Frequency, Time period
  • Content – Accurate, relevant, complete, concise, scope
  • Form – Clarity, detail, order, presentation, media

 

Information systems can help with decision making, diagrams can be used to impart the decision in a structured manner and to make sure decision rules are adhered to. One example of such a diagram would be a decision tree as seen below.

decision-tree

Decision Behaviour describes how people make decisions and the factors that influence them. There are two types;

  • Structured Decisions
  • Unstructured Decisions

A structured decision tends to involve situations where the rules & constraints influencing the decision are known. They usually routine and uncomplicated.

An unstructured decision tends to involve more complex situations, where the rules influencing the decision are complicated or unknown. They usually occur infrequently and rely on the experience of the decision maker.

In Information systems, diagrams are used to display the decision in a structured way and to ensure that the rules are defined correctly.

 

 

 

 

Business Intelligence

This is a grouped information concerning a business’s customers, competitors, partners, competitive environment and internal operations that gives the business the ability to make efficient, significant, tactical and strategic decisions. Big data is huge amounts of unstructured and semi-structured data from the web, sensors, stock market, social media and so on. Big data is of massive interest because it can display more patterns and interesting aberrations than smaller volumes of data. It has the potential to provide novel understanding into such areas as financial market activity, weather patterns, consumer behaviour, tidal movements and so on. To obtain value from big data, we need to use new tools which are able to work with non-traditional data along with traditional data.

These tools include the following;

  • Data warehouses
  • Data marts
  • Hadoop
  • In-memory computing
  • Analytical platforms

images (3)

Data warehouse: Stores present and past standardised data. It also provides analysis and reporting tools

Data marts: This provides a subset of the data warehouse’s data with an emphasis on a single subject or line of business.

Hadoop: Provides parallel processing of big data across cheap computers. It’s main features are;

  • Hadoop distributed file system.
  • MapReduce which breaks data into clusters to works on.
  • Hbase which is a NoSQL database.

In-memory computing: This uses RAM for data storage to make data retrievable at a faster rate. This can speed processing times from hours/days to just seconds.

Analytical platforms: These are high speed platforms that use both relational and non-relational tools for big data sets. One of these tools is OLAP (Online Analytical Processing), It has the following capabilities;

  • Supports multidimensional analysis of data.
  • It views data using multiple dimensions.
  • It can provide instant online answers to ad hoc queries.

Another analytical tool is data mining which performs the following functions;

  • Looks for hidden patterns in sets of data.
  • Generates rules to predict behaviour.
  • Produces data by associations, sequences, clustering and forecasting.

Text mining is also a common analytical tool; this extracts important elements of information such as facts, opinions and dates from large data sets.

y

There are six key elements of any effective business intelligence environment. These are the following;

  • Data from the commercial domain
  • The business intelligence infrastructure
  • Business intelligence analytics
  • Managerial users and functions
  • The delivery platform – Management Information System (MIS), Decision Support System (DSS), Executive Support System (ESS)
  • The user interface

The main objectives of business intelligence and analytics is to produce the following outcomes in real-times and also highly precise manner;

  • Production reports – For routine-type decisions e.g. Marketing, human resources, financial accounts
  • Parameterized reports
  • Dashboards to help the user experience
  • Search/report creation
  • Forecasts and scenarios

Predictive analytics

This is the use of various tools to forecast future trends and behaviour. These tools include the following;

  • Statistical analysis
  • Data mining
  • Historical data

Predictive analytics has numerous BI applications for sales, financial markets and fraud detection to name but a few.

Operational and middle managers utilize MIS (running data from TPS- Transaction Processing System) for routine production reports.

Super users and business analysts utilize DSS for more sophisticated analysis and custom reports and semi-structured decisions.

“What-if” analysis, Sensitivity analysis, Multidimensional analysis / OLAP and pivot tables are all examples of DSSs.

 

 

 

 

Data Mining and Lift and Chi Squared Analysis

Data Mining

Data mining is an analytical process developed to explore big data in order to detect consistent patterns or relationships between variables and to then substantiate the results applying the detected patterns to new subsets of data. The use of statistical formulas Lift and Chi squared can be used to detect levels of Interestingness in Big Data. This is one way to engage in data mining.

Lift measures the dependency/correlation between two sets of data. For example the Lift between A and B would be, Lift (A, B) =

Sup (A u B)/((Sup(A)*Sup(B)) where Sup is the support (likeliness) function, this is similar to the probability of something happening for a given data set)

If Lift(A, B) = 1 => A and B are independent

> 1: positively correlated

< 1: negatively correlated

An additional measure to test correlated events: X^2 or Chi Squared.

X^2 = Σ (Observed – Expected)2 / Expected

  • General rules

X^2 = 0 => independent

X^2 > 0 => correlated, either positively or negatively, so it needs additional test such as Kulczynski.

Please see below an example of a Lift and Chi squared calculation.

Lift Analysis

Chips ^Chips Total Row
Burgers 600 400 1000
^Burgers 200 200  400
Total Column 800 600 1400

Sup = Support.

Burger = B, Chips = C.

Lift(Burger, Chips) =

Sup(B u C)/((Sup(B)*Sup(C)) =

(600/1400)/((1000/1400)*(800/1400))  =  1.05 – This indicates a positive correlation between Burger and Chips.

Lift(B, ^C) =

Sup(B u ^C)/((Sup(B)*Sup(^C)) =

(400/1400)/((1000/1400)*(600/1400))  = 0.933333333…… – This indicates a negative correlation between Burger and ^Chips.

Lift(^B,C) =

Sup(^B u C)/((Sup(^B)*Sup(C)) =

(200/1400)/((400/1400)*(800/1400))  = 0.875 – This indicates a negative correlation between ^Burger and Chips.

Lift(^B,^C) =

Sup(^B u ^C)/((Sup(^B)*Sup(^C)) =

(200/1400)/((400/1400)*(600/1400))  = 1.166666666 …… – This indicates a positive correlation between ^Burger and ^Chips.

 

Shampoo ^Shampoo Total Row
Ketchup 100 200 300
^Ketchup 200 400 600
Total Column 300 600 900

K = Kitchup, S = Shampoo.

Lift(K,S) =

Sup(K u S)/((Sup(K)*Sup(S)) =

(100/900)/((300/900)*(300/900)) = 1.0, No correlation between K and S.

Lift(K,^S) =

Sup(K u ^S)/((Sup(K)*Sup(^S)) =

(200/900)/((300/900)*(600/900)) = 1.0, No correlation between K and ^S.

Lift(^K,S) =

Sup(^K u S)/((Sup(^K)*Sup(S)) =

(200/900)/((600/900)*(300/900)) = 1.0, No correlation between ^K and S.

Lift(^K,^S) =

Sup(^K u ^S)/((Sup(^K)*Sup(^S)) =

(400/900)/((600/900)*(600/900)) = 1.0, No correlation between ^K and ^S.

 

Chips ^Chips Total Row
Burgers 900 (800) 100 (200) 1000
^Burgers 300 (400) 200 (100)  500
Total Column 1200 300 1500

Chi Squared Analysis.

X^2 = Chi Squared.

X^2 = Σ (Observed – Expected)^2/Expected

^2 = Power of 2.

O = Observed; E = Expected.

B = Burger; C = Chips.

X^2(B,C) = (900 – 800)^2/800  = 12.5, As Observed > Expected, We have a positive correlation between B and C.

X^2(B,^C) = (100 – 200)^2/200  = 50.0, As Observed < Expected, We have a negative correlation between B and ^C.

X^2(^B,C) = (300 – 400)^2/400  = 2.5, As Observed < Expected, We have a negative correlation between ^B and C.

X^2(^B,^C) = (200 – 100)^2/100  = 100, As Observed > Expected, We have a positive correlation between ^B and ^C.

The Chi Squared result is the sum of the above 4 values; 12.5 + 50 + 2.5 + 100 = 165. As 165 is positive and as Observed > Expected (for B union C), we have a positive correlation between B and C.

 

Sausages ^Sausages Total Row
Burgers 800 (800) 200 (200) 1000
^Burgers 400 (400) 100 (100)  500
Total Column 1200 300 1500

B = Burger; S = Sausages.

X^2(B,S) = (800 – 800)^2/800  = 0, No correlation between B and S, they are independent of each other.

X^2(B,^S) = (200 – 200)^2/200  = 0, No correlation between B and ^S, they are independent of each other.

X^2(^B,S) = (400 – 400)^2/400  = 0, No correlation between ^B and S, they are independent of each other.

X^2(^B,^S) = (100 – 100)^2/100  = 0, No correlation between ^B and ^S, they are independent of each other.

The Chi Squared result is the sum of the above 4 values; 0+0+0+0 = 0. As the result is 0, we have independence between B and S.

Lift and X^2 would prove to be inadequate algorithms if there was a sizeable amount of null events/transactions in the data set.

Kulczynski’s algorithm would rectify this.

 

R Graphics.

R graphics construction:

I decided to do my use case on the relationship between GDP versus Employment Rate. I was interested to see how strong the correlation if any, was between these two data sets for OECD countries.

I obtained my raw data from http://stats.oecd.org/ and http://en.wikipedia.org/wiki/List_of_OECD_countries_by_GDP_per_capita.

I cleaned this raw data and created two CSV files – GDP.txt and Employment rate.txt.

The first file is Employment rate – this stored each country and it’s employment rate for 2012.

The second file is GDP – this stored each country and it’s Gross Domestic Product for 2012.

I now downloaded the “R” GUI programming interface for windows.

After opening up the “R” programming environment, I created a data frame for each of my CSV files.

gdp <- read.csv(“GDP.txt”, header=T)

ER <- read.csv(“Employment rate.txt”, header=T)

I then merged the above data frames and stored them in a new data frame called “countries”.

countries <- merge(x = gdp, y = ER)

The above created a data frame with 3 columns – Country, GDP and Employment_rate.

When I run print(countries), I obtain the following.

Country              GDP       Employment_Rate

Australia             44407            72.4

Austria                 44141            72.5

Belgium                40838            61.8

Canada                  42114            71.8

Chile                       21486            61.7

CzechRepublic      27527            66.0

Denmark                42787            73.1

Estonia                    24260            66.2

Finland                    39160            69.4

France                      36933            63.8

Germany                  41927            72.7

Greece                       25987            52.3

Hungary                   22635            56.6

Iceland                      39117            78.7

Ireland                      43803            58.7

Israel                         31364            66.0

Italy                           34141            56.9

Japan                         35482            70.4

Korea                         30011            64.2

Luxembourg            89417            64.8

Netherlands             43348            75.3

NewZealand             32888            72.7

Norway                      66135            75.8

Poland                        22782            59.6

Portugal                     25802            62.3

SlovakRepublic        25948            59.8

Slovenia                     28482            64.8

Spain                           32559            56.6

Sweden                       42865            73.7

Switzerland                53641            79.0

Turkey                         18328            48.2

UK                                 35671            69.4

USA                               51689            67.0

I then plotted the GDP column against the Employment_rate column.

Plot(countries$GDP, countries$Employment_Rate)

I then plotted a line showing the positive correlation between Employment Rate and GDP as seen below.

line <- lm(countries$Employment_Rate ~ countries$GDP)

abline(line)

Plot and correlation

I also ran the cor.test function as seen below;

test(countries$GDP, countries$Employment_Rate)

Pearson’s product-moment correlation

data: countries$GDP and countries$Employment_Rate

t = 3.1327, df = 31, p-value = 0.003767

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

1768204 0.7135485

sample estimates:

cor

4903624

As seen from the above results I got a p-value of 0.003767 which imparts a definite correlation.

I obtained a value of 0.49 for the cor value; this imparts a definite positive correlation between the employment rate and GDP.

Information gleamed at a glance.

The information gleamed from the dataset is that there is a definite positive correlation between employment rate and the GDP per Country.

The R graphics proved to be excellent for proving the correlation between GDP and Employment rate.

What other ideas/concepts could be represented via R Graphics.

This could be further expanded to include every country in the world to see would the correlation be similar to the above.

More data sets could be developed to test the correlations between;

IQ and GDP

IQ and Employment rate

Health and GDP

Health and Employment

Education and poverty

The above amongst other concepts/ideas would be very interesting to analyse further using R graphics.

Image of R course completion.

Course completion

 

 

Google fusion table of the Irish population by county.

Fusion Table

Irish county by population Google fusion table link.

Heat map construction:

  • I obtained the raw data from two sources.
  • The first source was this link http://www.independent.ie/editorial/test/map_lead.kml. This provided the raw kml data for each Irish county. The kml data file basically provided the geometry for each county.
  • The second source was from this link http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/. This provided the population per county. I needed to clean up this data and input it into a Microsoft excel spreadsheet. The final spreadsheet contained just the following columns; County, Population and Population Density. I calculated the population density by using the following formula; Population/County area.
  • I then created two fusion tables from the spreadsheet and the kml data file and saved them on my Google drive.
  • The fusion table created from the spreadsheet contained columns, County, Population and Population Density. The fusion table created from the kml data file contained three columns; Name (of County), Description and Geometry (containing the kml data).
  • I then merged the two fusion tables using the merge tool. I fused together the two tables for the following columns Name, Population,Population Density and Geometry.
  • I set the location marker to geometry.
  • In “Feature map” –  “Change features styles” – “Polygons” – “Fill Color”, I clicked on the radio button for “Show a gradient” and set column for “Population Density”.
  • I set the gradient to darkest for those counties with the highest population densities.
  • These settings display very well the varying intensities of the population density for each county.
  • I set the information window upon clicking on a county to display the following; County name, Population and Population Density.
  • I also included a population density legend in the right hand corner.
  • I finished off by publishing my fusion table and also changing the visibility setting to “public”.

Information gleamed at a glance.

  • One can gleam from the fusion map at a quick glance the counties that have the highest population densities with just a quick reference to the legend in the bottom right hand corner.
  • If I am interested in a particular county, I can get more specific information on it by clicking anywhere within that county’s boundary. After clicking it, I will be shown for that county the following; County name, Population and Population Density.
  • It can be seen quite easily from the map that the population density is most prominent in Leinster and in particular county Dublin and it’s surrounding counties. The map also shows high regional population densities in Galway and Cork.
  • This is in agreement with the continual trend of population increases in Dublin and it’s surrounding counties and to a lesser extent Galway and Cork.

Implications for now and the future.

  • There is a vast array of implications and ideas that can be taken based on their population densities shown in the map.
  • Education – Future schools may need to be built based on current and emerging trends – map the country’s schools.
  • Rural development – Can accurately target areas that need development.
  • Health – Future hospitals may need to be built based on current and emerging trends – map the country’s hospitals.
  • Road network/infrastructure – Targeted development on specific areas and focusing on key areas for consistent maintenance – map the country’s road network.
  • Communications – Help in adequately facilitating high demand areas.
  • Business planning and mapping – Setting up businesses to take advantage of certain population densities. Helping multi-national companies and foreign direct investment target areas more effectively.