Book - 2.3) Modeling - Analyzing Abstractions
While abstractions reduce the complexity of the descriptions of real-world entities, what remains can often be too large or complicated to readily understand. Imagine, for example, the abstraction of a "household" created by the U.S. Census Bureau that has several tens of properties and tens of millions of instances. Trying to answer questions about "households" using this abstraction is a challenge. We need some techniques to reduce number of properties and cope with the number of instances.
Two way of analyzing abstractions involving computing:
- a quantitative summary that reduces a property to a simple numeric description, and
- a visualization that depicts the data of one or two properties in a compact form.
These techniques focus attention on one or two properties at a time. If we think of an abstraction as a table this means that we are working with only one or two columns at a time. In other words, the complete abstraction is simplified so that each instance is represented by only one or two properties. These techniques are often used together and in multiple ways on the complete abstraction to gain understanding of different facets of the abstraction. No one depiction tells us everything that we need to answer questions about the abstraction and the real world that the abstraction represents.
The quantitative summaries that we will use are straightforward calculations. A computer is needed to perform these calculations on a large mass of data. Each quantitative summary is a single number (or a pair of numbers) that tells us something about the property it is summarizing.
A visualization brings to bear our visual ability - one of the most powerful of our senses.The insights from visualizations come from observing meaningful apparent differences, relationships, patterns, or trends in the visualized information. The word "apparent" was used above because statistical analysis is needed to determine if the visual evidence is "significant". In this sense knowledge about statistics is an important adjunct to the knowledge about computation that is the focus of this course. Here we will learn how to use computation to manipulate the data and produce different forms of visualizations. Interpreting the visualizations yields answers to questions about the real-world phenomenon described by the data.
Quantitative Summaries
A simple way to summarize a large amount of data is by computing a single number whose value depends in some way on all of the data values. The table below lists a number of possible ways to summarize data. The average, for example, characterizes all of the data values in a single number. Computing this number involves summing up all of the data values and dividing by the number of values. The average can be computed for a small number of instances or a massive number of instances making it useful for analyzing large abstractions.
Quantitative Summary | Description |
average | the total of a property's value over all instances divided by the number of instances |
maximum, minimum | the largest (smallest) value of a property's value |
range | the pair (maximum, minimum) defining the limits of the property's value or the difference maximum-minimum defining the extent of the property's value |
threshold | the number of instances above or below a given value |
Of course, real-world phenomenon are often too intricate to be fully understood by a single number. The average value does tell us something about the instances. When most of the instances are close to the average the average is a good description. However, if there are two large sub-groups each far away from the average then the average is not a very good description. In this class we will use these simple quantitative summaries so that we can focus on learning about computation. You can learn about more complicated and powerful quantitative summaries in areas like data mining, machine learning, or statistics.
Visualizing Abstractions
While there are many types of visualizations, we will focus on standard graphical displays that are powerful and easy to use. These are:
- line plot,
- histogram,
- scatter plot, and
- bar graph
You may have used these graphical displays already in other settings. Each of these graphs will be explained and illustrated using real-world data.
The characteristics of these four kinds of graphs is summarized in the following table and discussed below. As you can see in the table, the visualizations we are using are simple because they use only one or two properties of what might be a very complex phenomenon. You can also see that most of the visualizations use quantitative properties. Quantitative properties are those that are represented by a numeric value (with or without a decimal point). For example, an earthquake could be represented by its magnitude (a number from 0.0 to 10.0 on the Richter scale) or the depth in the earthquake's epicenter (measured in kilometers). The categorical property used in a bar chart will be discussed below.
Graph Name | Shows | Number of Properties | Type of Properties |
Line graph | trend or variation | 1 | quantitative |
Histogram | frequency | 1 | quantitative |
Scatter plot | relationship | 2 | both quantitative |
Bar chart | comparison | 2 |
1 quantitative, 1 categorical |
Examples
The examples that follow use the State Crime data set and the Earthquakes data set. The State Crime data set, from the U.S. Department of Justice, contains information about the occurrence of property and violent crimes in each of the states in the U.S. during the period from 1960-2012. There are specific kinds of property and violent crime reported together with "all" (the total of all property crimes or all violent crimes). The occurrence of crime is measured as a rate (number of crimes per 100,000 residents in the state) or as a total (number of crime regardless of the population size). For consistency, the examples below will use the same measure: the rate of all property crimes. The Earthquakes data set, from the U.S. Geological Survey, contains information about the occurrence of earthquakes around the world. Each earthquake is described by the U.S. state or country in which it occurred and measurements about the earthquake (its magnitude, the depth of its epicenter, etc.).
Line Plot
A line plot shows a possible trend or, when a trend is not evident, it also shows variation. Each instance of the phenomenon is described by a single quantitative property. For example,
- if each instance of a recorded earthquake is described by its magnitude the line plot would show any trend (e.g., whether the earthquakes are increasing in magnitude over time, or whether there is a cycle of severe earthquakes followed by a period of low magnitude quakes, or whether a severe earthquake is preceded by a series of smaller magnitude earthquake). In addition, the line plot might show whether there is a lot of variation among the magnitudes of earthquakes or if they are all of relatively similar magnitudes.
- if property crime in Virginia in a year is described by a crime rate (number of property crimes per 100,000 people) the line plot could shown whether the crime rate is increasing, decreasing, or remaining relatively the same. In addition, the line plot might show whether this are significant differences in the property crime rate in different years.
The figure below shows two line plots: the top shows the earthquake scenario describe above and the one on the bottom shows the property crime scenario. As you can see, there is no apparent trend in the earthquake scenario but there is considerable variation in magnitudes. However, the property crime data shows a very decided trend.
Line Plots
Histogram
A histogram shows the frequency - some measure of the occurrences - of a single quantitative property over a set of "bins". A "bin" is defined by a range of values and the set of bins is selected so that each instance falls into exactly one bin (i.e., the ranges of any two bins do not overlap) and the bins cover the entire range of possible values. For example, if the magnitude of earthquakes ranges from 0.0 to 10.0 that any of the following would be a valid set of bins:
- [0.0-5.0], [5.1-10.0]
- [0.0-2.5], [2.6-5.0], [5.1-7.5], [7.6-10.0]
- [0.0-1.0], [1.1-2.0], [2.1-3.0], ..., [9.1-10.0]
The first example has two bins. The first bin is for earthquakes that have a magnitude between 0.0 and 5.0 inclusive and the second bin is for earthquakes that have a magnitude between 5.1 and 10.0 inclusive. Notice that every value from 0.0 to 10.0 falls into exactly one bin (assuming that the magnitude is represented with the accuracy of single number after the decimal point). The second example has four bins and the third example has 10 bins.
You can think of picking the bins as changing the focus on how you are looking at the data. With many bins you are "zooming in" on the data and are seeing a lot of trees but it may be hard to see the overall shape of the forest. With few bins you are "zooming out" on the data and may only see one big forest without any important detail. Finding the "right" set of bins to see the shape of the data may require some experimentation.
In the histogram each bin is shown as a rectangle whose height is proportional to some measure of instances in that bin. The measure might be the number of instances in the bin or the average value of the property of all instances in that bin. The rectangles are placed side-by-side in the order of their ranges (typically the smallest range is at the left and the highest range is at the right). The histogram can answer questions like: Are there differences in the number of instances in each range (i.e., are the bars of about the same height or are there notable differences in the heights)? Are there a small number of categories which contain most of the instances? Does the distribution have the general appearance of well known statistical distributions (a bell shaped curve - with the middle categories having most of the instances and smaller numbers at either end; an exponential curve - with one range (either the smallest range or the largest range) having most of the instances with a sharp decline to the opposite range)?
The figure below shows a histogram of the earthquake magnitude data (on the top) and the Virginia property crime data (on the bottom). You can see that the most numerous earthquakes are those with magnitudes in the range of 1.8 and 2.0. There are 60 earthquakes in this bin while there are only 10 earthquakes in the range of 1.0 to 1.2. In the property crime rate histogram you can see that in 11 years the crime rate was at the level of 4,000 crimes (per 100,000 individuals in the population). Recall that the histogram does not tell us which 11 years these are: they could be 11 consecutive years or 11 years scattered throughout the period of the data collection.
Histograms
Notice that the data used in the above line plots and the histograms is exactly the same. What is different is not the data but the way the data is visualized. Each graph gives us a different way to present the data to see possibly meaningful patterns. Finally, notice that each of these graphs only required one quantitative property for each instance (e.g., the magnitude value for an earthquake or the property crime rate for a state.).
Scatter Plot
A scatter plot shows the relationship between two quantitative properties. In this case each instance has two quantitative properties and the scatter plot allows us to see if there is any apparent relationship between these two properties over the set of instances. For example, you often read about medical studies trying to find a relationship between foods (red meat) and some health condition (risk of heart attack). In this case, each instance (a person) is represented by two quantitative properties, one measuring the consumption of the food in question and one measuring the occurrence or likelihood of the medical condition. It is important to remember that a "relationship" is not the same thing as "causation": the fact that a clock shows 6AM when the sun rises does not mean that either causes the other to occur. However, relationships are important because they give suggestions on where to look for causation or take action as if there were causation even if we do not know the underlying cause.
In a scatter plot each instance is drawn as a single "dot" on the graph. The location of the dot is determined by the instance's two quantitative properties. One property gives the location of the point on the horizontal axis and the other property gives the location of the point on the vertical axis.
The scatter plot can answer questions like: Does the grouping or pattern of "dots" indicate one or more clusters of possibly related elements? Does the alignment of "dots" show a positive relationship where both properties seem to increase (or decrease) together? Does the alignment of "dots" show a negative relationship where the increase (or decrease) in one property is associated with a decrease (or increase) in the other property?
Two examples of scatter plots are shown in the following figure. The top figure shows a scatter plot of two properties of earthquakes in California. This figure shows the relationship between the longitude of an earthquake's location (horizontal axis) and the latitude of an earthquake's location (vertical axis). In this case it can be seen that the earthquakes do not occur with equal likelihood everywhere in California. There appear to be two "bands" or regions where earthquakes tend to cluster. The bottom part of the figure shows the relationship between the rates of property crime (horizontal axis) in Virginia versus the rates of violent crime (horizontal axis) in Virginia. This graph suggests that there is a positive relationship between these two types of crime - when one is high the other is also high. Notice that he scales of the two axis are very different: the horizontal axis is from 1500 to 4500 while the vertical axis is from 180 to 380. This difference in scale means that while the rates of the two types of crime are related property crime is much more prevalent than violent crime.
Bar Chart
A bar chart also uses two properties, one of which is quantitative and the other is categorical.
A categorical property is represented by a set names denoting distinct groups (or categories). For example, a college student might have the categorical property of "class" whose value is one of "freshman", "sophomore", "junior", or "senior". The categories are distinct, meaning that each student can be classified into exactly one of these four categories (if there are exceptional cases there might also be a catch all category named "other"). In a bar chart all instances with the same categorical value are grouped together. A single value is used to characterize each category. A rectangle is used to represent each category where the height of the rectangle is proportional to the value characterizing that category.
A bar chart is used to show comparison. Two bar charts are shown in the following figure. The bar chart on the top compares earthquakes in four states. The categories in this case are the four states: Alaska, California, Hawaii, and Washington. The height of each rectangle indicates the number of earthquakes in each category. A alternative bar chart might compare the average magnitude of earthquakes across these states. In this case the categories are the same but the height of the rectangle indicates the average magnitude of earthquakes rather than their average magnitude. The bar chart on the bottom compares the average property crime rate in three states: Virginia, Texas, and Florida. In this case the categories are the names of the states. The height of each rectangle indicates the average property crime rate in that state. Alternative bar charts might show the comparison only in a given year but include more states, or show the average property crime rate only for a selected period (2000-2010).
Bar Charts
It is important not to confuse a histogram with a bar chart even though are visually similar. Both visualizations use rectangles whose heights represents some value of a collection of instances. However, a critical difference is that in a bar chart the collection are defined by categories while in a histogram the collection are defined by bins. Recall that the bins can be changed but categories cannot be changed - California is California and cannot be anything else. Also, you can interchange the rectangles in bar chart without changing the meaning. For example, in the comparison of earthquakes in California and Alaska it does not matter which rectangle appears to the right of the other because the comparison is still the same. However, the bins in a histogram cannot be interchanged because the bins are ranges over some quantitative scale.
Summary
The key ideas we have seen are:
- there are techniques for analyzing abstractions using quantitative summaries and visualizations that focus on one or two properties at a time,
- visualization is useful to compactly show data so that the meaning of the data is easier to see,
- there are simple visualizations for observing apparent trends, distributions, relationships, and comparisons, and
- the same data is usually summarized in different ways and/or viewed using multiple visualizations (e.g., both a line plot and a histogram of instances with a single property) to answer different questions.