Book - 7.2) Project - Overview

Outline

There are seven major steps in completing the project. These steps are divided into activities surrounding the data set and those activities related to the computational analysis.

The activities related to the data set are:

Abstraction: Understanding the abstraction of the real-world entities represented by the data set you selected; knowing the organization, structure, and content of the data set, including constructing a data map.

Questions: Deciding on the questions to be answered by the computational analysis of the data set; understanding why the answers to these questions are important and to what audience(s).

Limitations: Listing factors that limit the generality of the possible conclusions that can be drawn from the computational analysis due to restrictions imposed by the data or the method of analysis.
Social Impacts: Identifying possible stakeholders and the benefits or harms that might affect these stakeholders from the results the project; considering conflicts among the stakeholders.

The activities related to the computational analysis are:

Programming: Designing the algorithms and writing the code to perform the computational analysis of the data set.

Visualizations: Designing and computationally generating informative displays of the data set to answer the exploratory questions.

Conclusions: Using the visualizations to answers the exploratory questions; exploring the implications of these answers; developing a

Each of these seven components will be illustrated in the following sections of this chapter.

The outcomes from the project's activities result in:

a project presentation: a video of approximately 5-6 minutes describing the project. This video is created by adding narration to a set of PowerPoint slides, exporting it in a video format, and uploading it to YouTube.
the project's program: the Python code developed to manipulate the data set and generate the visualizations.

The project presentation contains a description of six of the seven activities described above - all those except the programming itself. A rubric for evaluating the project presentation is given below. The programming done for the project will be evaluated by assessing the quality of explanations given about selected parts of the Python code. The mechanics for this part of the project evaluation will be described in class.

Questions

A key element of the project is posing interesting and meaningful questions that are to be answered by analyzing the data set you have selected. Two general categories of questions are analysis questions and existence questions.

Analysis questions explore the quantitative characteristics of the real-world phenomenon described by the data and the possible relationships between or among these characteristics. Forms of questions of these types are as follows:

Characterization questions

What is the distribution of ... ?
What categories of values are there for ...?
What is the range of ...?
How much variation is there in ...?
What is the likelihood of ...?
The proportion falling into a given category is ... ?

Relationship questions

How is factor x related to factor y?
How are factors x, y, and z related?
How does factor x change over time?
Is there a trend in the data?
How is factor x distributed over space?

Various forms of visualizations help to answer analysis questions. For example, line graphs and histograms are useful ways of showing the variation and distribution of the data; scatter plots are good ways to explore the relationship between two characteristics, revealing trends or clusters of data that suggest interactions between the two characteristics.

Existence questions explore a unique or critical aspect of the phenomenon. These questions focus on identifying data points that have particularly interesting combination of properties. Forms of questions of this type are as follows:

Selection Questions

What is the best match to ... ?
Are there multiple occurrences of ... ?
Are there distinctive categories of ...?

Capability questions

Is it possible for...?
Under what circumstances will ...?

Different types of visualizations help to explore the answers to these kinds of questions. Line graphs or scatter plots help to identify distinctive points in the data (sometimes called outliers) or identify data within certain ranges of interest. Bar charts and histograms help to explore and compare occurrences of different attributes.

Limitations

Our ability to model the real world using information may be imperfect, especially if the real world entity is highly complex or not well understood. Thus, the answers to question that are obtained by analyzing this information model are limited by the imperfections in the data or our analysis of the data. All information models have limitations. Thus, it is important to identify and acknowledge the specific limitations of the model.

Some possible sources of limitations are the following.

Abstraction: A frequent limitation is due to properties that are excluded from the abstraction. The dilemma in forming an abstraction is to anticipate what information properties should be included and which should be excluded. Including too many properties makes the abstraction unfocused and unwieldy. However, a penalty for excluding some properties is that we cannot ask questions or draw conclusion that rely on the excluded data. For example, if we exclude the weight of a patient from a patient abstraction then we cannot draw conclusions about the role of weight in the patient's health.

Completeness: Another common occurrence is that the values might not be available for some properties in some instances. Gaps in the data can arise for a variety of reasons. If the data is self-reported the responder may know the desired value, may not wish to report the value, or simply overlooked this part off the report. If the data is generated by sensors (e.g., temperatures, humidities, etc.) or other automated means the data may be missing because of "mechanical" problems with the sensor (e.g., loss of power to the sensor, loss of network connectivity for the sensor to report, etc.). Obviously, the more missing data there is the more the conclusions are limited.

Precision: In some problems the accuracy of the data might be insufficient to make conclusions at a very precise level of detail. The lack of precision may be a reflection of the underlying ability to gather the data. This is particularly true when the data is projected or estimated. For example, a number indicated the gross domestic product of the U.S. (in the multiple trillions of dollars) cannot be estimated down to the last penny (or perhaps even the last hundreds of millions of dollars).

In addition to these general limitations, the data that you are working with may be subject to other limitations.

Some factors are, however, not inherent limitations of the data even though they may be challenges for the person using the data. Some examples that are not inherent limitations of the data are the following:

Complexity. The data set may be very complex including more detail than is needed to answer the specific questions of the project. While this extra size may be annoying it is not a source of limitations on the ability to use the data in answering questions.

Structure. The data might be organized in such a way that extracting the data to answer the project's questions is tedious or awkward. While this may be a frustration for the data's user, it does not inherently limit the questions that can be answer.

Evaluation

As noted above, the project has two major outcomes: the project's presentation and the project's Python code. Each of these outcomes will be evaluated as describe further in this section.

Presentation Evaluation

The project's video presentation is evaluated by a rubric that rates as missing, poor, good, or excellent each of the following 8 statements.

The presenter explained the questions to be answered and their importance.
The presenter explained the role of abstraction in defining the information properties relevant to the project’s questions.
The presenter explained the limitations of the available data in answering the questions.
The presenter explained the structure of the data.
The presenter explained the meaning of the visualizations used to answer the questions.
The presenter explained the answers to the questions based on the visualizations.
The presenter explained the social implications of the project.
The presenter communicated effectively.

General Guidelines

In all cases a rating of Missing is given if the required element is not recognizably present. It is the responsibility of the presenter to present each element clearly.

An individual required element may be presented over multiple slides and a single slide may relate to multiple required elements.

It is not important in what order the required elements are addressed in the presentation. However, the presentation should have a logical sequencing that makes the presentation understandable and easily relatable to the rubric elements.

Here are detailed descriptions of these elements.

Specific Explanations

The presenter explained the questions to be answered and their importance.

Element: The presenter demonstrated problem-solving skills by describing the questions the project is intended to answer and identifying the significance of these answers to some group.

Rating:

Poor: states the questions and claims that answering these questions is important.
Good: states the questions and gives an argument for the importance of answering the questions.
Excellent: states the questions and an argument supported by evidence (expert opinion, examples, economic factors, etc.) for the importance of answering the questions.

The presenter explained the role of abstraction in defining the information properties relevant to the project’s questions.

Element: The presenter demonstrated an understanding of the concept of abstraction by explaining the information properties used to model the real-world entities relevant to the project’s questions.

Rating:

Poor: gives a definition of abstraction and names the real-world entities.
Good: defines abstraction and describes characteristics of the real-world entities.
Excellent: defines abstraction, identifies specific information properties of the real-world entities, and explains how these properties are relevant to the project’s questions.

The presenter explained the limitations of the available data in answering the questions.

Element: The presenter demonstrated quantitative reasoning by identifying factors that restrict the scope (comprehensiveness, completeness) or accuracy (precision) of the answers that can be obtained from the available data.

Rating:

Poor: General factors related to the data are described without clear indication of their impact on the answers.
Good: Specific factors are identified but without clear indication of their impact on the answers.
Excellent: Specific factors are identified and the impact of these factors on the answers is clearly explained.

The presenter explained the structure of the data.

Element: The presenter demonstrated an understanding of data structures by explaining the technical organization of the data.

Rating:

Poor: A diagram or other visual representation of the data’s organization is presented and only partially or confusingly described.
Good: A diagram or other visual representation of the data’s organization is presented and well described using general language.
Excellent: A diagram or other visual representation of the data’s organization is presented and described using correct technical language.

The presenter explained the meaning of the visualizations used to answer the questions.

Element: The presenter demonstrated the ability to analyze complex data by explaining the form, content, and significance of the generated visualizations that answer the project’s questions.

Rating:

Poor: Inappropriate, incomplete or ambiguous visualizations are presented whose relationships to the questions is not clear.
Good: One form of visualization appropriate to the data and the questions is used. Visualizations are not always appropriate titled or labelled. The relationship of the visualizations to the questions is reasonably clear.
Excellent: Multiple forms of visualizations appropriate to the data and the questions are used. Visualizations are appropriately titled and labelled. The relationship of the visualizations to the questions is clearly identified.

The presenter explained the answers to the questions based on the visualizations.

Element: The presenter demonstrated a quantitative reasoning ability by explaining how conclusions based on the visualizations answer the project’s questions.

Rating:

Poor: The implications of the visualizations are not explained or are misinterpreted.
Good: The implications of the visualizations are clearly explained but not directly related to the answer for each question.
Excellent: The implications of the visualizations are clearly explained and directly related to the answer for each question.

The presenter explained the social implications of the project.

Element: The presenter demonstrated an awareness of the social impact of computing by explaining how the results of the project could have positive or negative effects on individuals, groups, or society.

Rating:

Poor: states generally that there are social impacts
Good: identifies appropriate stakeholders and specific impacts that may be experienced by these stakeholders.
Excellent: identifies stakeholders, impacts, and explains possible conflicts among the stakeholders.

The presenter communicated effectively.

Element: The presenter demonstrated oral and visual communication skills by presenting the project in an effective manner.

Rating:

Poor: The presentation was of inappropriate length for the content or the presenter was not clearly understandable.
Good: The presentation was of appropriate length, the presenter was clearly understandable, but the slides lacked some degree of organization or clarity.
Excellent: The presentation was of an appropriate length; the presenter was clearly understandable, and the slides conveyed information in an organized and clear manner.

Code Evaluation

The Python code developed to analyze the data and produce the visualizations is the second major outcome of the project. The code submitted for the project will be evaluated in the following way. Up to five lines of code will be selected. The selected lines of code will typically involve a significant aspect of the computational. For each selected line of code the developer will be asked to provide a short explanation of the meaning and purpose of this code in the computation. Each explanation will be assigned a rating of missing, poor, good, or excellent according to this rubric:

missing: a meaningful explanation is not given.
poor: a partial explanation of the technical meaning of the code using general language is given with no or only a vague description of the purpose of this code in the overall computation.
good: a partial explanation of the technical meaning of the code using correct terminology is given together with a general, though not specific, description of the purpose of this code in the overall computation.
excellent: a thorough explanation of the technical meaning of the code using correct terminology is given together with a clear description of the purpose of this code in the overall computation.

An explanation which otherwise meets the requirements for a rating will be downgraded if the explanation also contains incorrect or irrelevant statements, or is inappropriately long.