Faculty of Natural Sciences
School of Computer Science and Mathematics
Assessment Brief
Module |
CSC-40054 |
---|---|
Assessment Component |
Coursework |
Weighting |
50% |
Deadline |
13:00 pm on 2nd March 2024 |
Module Leader |
Wenjuan Zhou |
Office Hours/Meeting Booking Link |
Monday and Tuesday during semester 2 teaching. Email: w.zhou@keele.ac.uk to arrange a meeting. |
1. What is the task for this assessment?
One report should be submitted that covers the following two tasks:
โข Part I: Database management
โข Part II: Data analytics
2. What is required of me in this assessment?
Guidelines |
Detailed guidelines for the tasks are listed below. |
---|---|
Self- assessment checklist Make sure that youโฆ |
โข The required components and weightings of each part are detailed below. Use these lists to structure and focus the content of your report. โข Use the materials discussed in the lectures and tutorials to complete the tasks. โข Ensure that your codes run correctly. |
Three key pieces of advice based on the feedback given to the previous cohort who completed this assignment |
โข Read the task description thoroughly and ask for clarification. โข Read the formatting guidelines and stick to them. โข Start working on your coursework early. Ensure that you understand the tasks and the submission process. |
Formatting Guidelines |
โข Your submission should be in the form of a single zipped file. This zipped file should be named with your Student Number. |
โข The zipped file should contain your report in two formats MS Word and PDF. So, it should contain ONLY two files. Any other submitted file will not be marked. โข For each task, you should include the screenshot of the output after you run the code (when its execution finishes successfully). Failing to include the output screenshot will cause you to lose marks. โข The codes should be included as the appendix at the end of the report. You should clearly specify which code is related to which task. Otherwise, they will be marked zero. โข Screenshots of codes are not accepted in the appendix and will be marked zero. You should include the original code with the correct indentation so anyone can copy and paste the code into an editor and run it. โข If you have written the code, but it doesnโt run correctly, mention it in your report. โข An example of the accepted appendix is shown at the end of the coursework. |
|
---|---|
Referencing Style |
Harvard referencing style |
Assessment Criteria/ Markscheme: |
The assessment criteria and mark scheme are listed separately below |
3. What is the purpose of this assessment?
The following table shows which of the module learning outcomes are being assessed in this assignment. Use this table to help you see where and how to transfer feedback from one assignment to another. Note that your feedback may mention some of these outcomes, but that you will not receive a โmarkโ against each one.
Module Learning Outcomes assessed |
---|
1. Evaluate available data and determine how best to analyse the information available to provide required outcomes. 2. Evaluate machine learning methods in the context of statistical analysis of data representing social or natural systems. 3. Develop advanced applications of statistical data analytics techniques using an advanced specialist programming language. 4. Assess the options of storing, managing and manipulating very large volumes of data in the context of research or business organisations. 5. Assess a range of statistical approaches and apply the correct statistical approaches to extract information from a set of data typically available in a modern business or research organisation. |
Rationale |
โข Parts I assess your ability to access and manipulate databases โข Part II assesses your ability to critically evaluate and apply big data applications, advanced analytics, and statistical modelling techniques appropriate to different types of problems. |
---|
4. What resources might I use to get started?
Lecture slides and your completed in-class activities. However, when needed, you are expected to research and use different resources to complete the tasks. |
---|
Guidelines for Tasks
For the coursework assignment, you will be working on a dataset from a car-sharing company. The dataset contains information about the customersโ demand rate between January 2017 and August 2018. The data were collected on an hourly basis and included the time data such as date, hour, and season as well as weather data such as the weather condition, temperature, humidity, and wind speed. The โdemandโ column represents the customerโs willingness for renting a car for a specific time. Higher demand rates show that customers are more willing to rent a car and vice versa. A complete description of the data is also shown in Table I.
IMPORTANT NOTICE: You should complete the Part I (database management) tasks using only the sqlite3 python module and SQL statements. You shouldnโt use any other python modules for these tasks. However, you can use any modules for Part II (data analytics) tasks or for importing and exporting data.
Download the dataset โCarSharingโ from the KLE and complete the tasks.
Part I: Database Management (50%)
1. Create an SQLite database and import the data into a table named โCarSharingโ. Create a backup table and copy the whole table into it. [5%]
2. Add a column to the CarSharing table named โtemp_categoryโ. This column should contain three string values. If the โfeels-likeโ temperature is less than 10 then the corresponding value in the temp_category column should be โColdโ, if the feels-like temperature is between 10 and 25, the value should be โMildโ, and if the feels-like
temperature is greater than 25, then the value should be โHotโ. [5%]
3. Create another table named โtemperatureโ by selecting the temp, temp_feel, and temp_category columns. Then drop the temp and temp_feel columns from the CarSharing table. [5%]
4. Find the distinct values of the weather column and assign a number to each value. Add another column named โweather_codeโ to the table containing each rowโs assigned weather code. [5%]
5. Create a table called โweatherโ and copy the columns โweatherโ and โweather_codeโ to this table. Then drop the weather column from the CarSharing table. [5%]
6. Create a table called time with four columns containing each rowโs timestamp, hour, weekday name, and month name (Hint: you can use the surftime() function for this
purpose).
[5%]
7. Assume itโs the first day you have started working at this company and your boss Linda sends you an email as follows: โ
Hello, welcome to the team. I hope you enjoy working at this company. Could you please give me a report containing the following information:
a) Please tell me which date and time we had the highest demand rate in 2017. [5%]
b) Give me a table containing the name of the weekday, month, and season in which we had the highest and lowest average demand rates throughout 2017. Please include the calculated average demand values as well. [5%]
c) For the weekday selected in (b), please give me a table showing the average demand rate we had at different hours of that weekday throughout 2017. Please sort the results in descending order based on the average demand rates.
[5%]
d) Please tell me what the weather was like in 2017. Was it mostly cold, mild, or hot? Which weather condition (shown in the weather column) was the most prevalent in 2017? What was the average, highest, and lowest wind speed and humidity for each month in 2017? Please organise this information in two tables for the wind speed and humidity. Please also give me a table showing the average demand rate for each cold, mild, and hot weather in 2017 sorted in descending order based on their average demand rates. [5%]
e) Give me another table showing the information requested in (d) for the month we had the highest average demand rate in 2017 so that I can compare it with other months. [5%]
NOTICE: Full marks for task 7 will be given to solutions that use the CarSharing table after all changes in tasks 1-6 have been made to it.
Part II: Data Analytics (50%)
1. Import the CarSharing table into a CSV file and preprocess it with python. You need to drop duplicate rows and deal with null values using appropriate methods. [5%]
2. Using appropriate hypothesis testing, determine if there is a significant relationship between each column (except the timestamp column) and the demand rate. Report the testsโ results. [5%]
3. Please describe if you see any seasonal or cyclic pattern in the temp, humidity, windspeed, or demand data in 2017. Describe your answers. [5%]
4. Use an ARIMA model to predict the weekly average demand rate. Consider 30 percent of data for testing. [7.5%]
5. Use a random forest regressor and a deep neural network to predict the demand rate and report the minimum square error for each model. Which one is working better? Why? Please describe the reason. [10%]
6. Categorize the demand rate into the following two groups: demand rates greater than the average demand rate and demand rates less than the average demand rate. Use labels 1 and 2 for the first and the second groups, respectively. Now, use three different classifiers to predict the demand ratesโ labels and report the accuracy of all models. Use 30 percent of data for testing. [10%]
7. Assume k is the number of clusters. Set k=2, 3, 4, and 12 and use 2 methods to cluster the temp data in 2017. Which k gives the most uniform clusters? (Clusters are called uniform when the number of samples falling into each cluster is close.) [7.5%]
NOTICE: You should use the original CarSharing dataset for Part II, not the dataset you have modified in Part II.
Guidelines for Preparing the Appendix
The following example shows a well-formatted appendix.
As you can see, PART I and PART II are clearly separated, and the code can be copied and pasted. The following steps show how you can copy your code and paste it into your report in the same way as the above example:
1. Select the code in your IDE using the mouse.
2. Use CTRL+C to copy the code.
3. Use CTRL+V to paste the code into your report.
Assessment Criteria
The criteria are adapted from the University assessment criteria and relate to all parts of this assessment.