CREU 2016-2017: April 2017

Thursday, April 27, 2017

Week 30: 4/22/2017 - 4/26/2017

Work Accomplished:
This week, I printed out the data from Databricks after running the small xml file through Apache Spark. I am in the process of writing the abstract. Next week, I will be presenting this project to the YSU CSIS department as part of my senior capstone. I wrote the draft of our final report for the CREU report for Jenna to edit.

Outcomes:
Finished final report, finished senior project and finished the study as a whole.

Thursday, April 20, 2017

Week 29: 4/12/2017 - 4/19/2017

Work Accomplished:
This week, Jenna helped me to import the Kaggle data into Databricks to run Apache Spark on to train the data. It was challenging to set up a cluster and get the data to import correctly into each column. The next step is to run the bigger data set and test it so that we can compare against the data sets.

Weekly goal: Finish the dataset testing.

Future goal: Finish testing the datasets and use the keyword results to inform our tags so that we might be able to predict tags on Stack Overflow. We also need to submit our final paper for this project by May 5th

Wednesday, April 19, 2017

Week 29: 4/12/2017 - 4/19/2017

Work Accomplished

This past week, I worked on using DataBricks to run Apache Spark on a small dataset from the Kaggle competition with Alyssa. Alyssa and I worked together to move past the trouble we had last week and complete running Spark on the small dataset. I also presented a short summary of our paper, On Predicting Developer Expertise from Eye Gazes for Bug Fixing Tasks, that was accepted in the YSU Honors College Journal at the kick-off luncheon for the journal. This paper was about the results from last year's CREU project.

Goal

Weekly goal(s) - In the next week, I will finish up the analyses we have left for the keyword predictions and start writing our end of the year report.
Long-term goal(s) - Predict keywords by modifying the process of a propose method for keyword prediction in a Kaggle competition by incorporating eye-tracking. We will also predict keywords without eye-tracking and compare the two keyword sets generated. These keywords inform our tags, so determining them will tell us which pieces of code and/or text in a StackOverflow document are pertinent to tag selection.

Outcome(s)

Finished running Spark jobs in DataBricks on the Kaggle competition data
Presented results from a paper we wrote during last year's CREU project at an event for the journal it was accepted at

Thursday, April 13, 2017

Week 28: 4/5/2017 - 4/12/2017

Work Accomplished

This past week, I worked on using DataBricks to run Apache Spark on a small dataset from the Kaggle competition with Alyssa. Alyssa and I worked together to walk through the DataBricks User Guide (https://docs.databricks.com/user-guide/index.html). We were able to get the small dataset loaded into DataBricks and start up a cluster to run our Spark jobs on it. We ran into some trouble, because the Community Editions of DataBricks doesn't seem to allow us to run jobs on our cluster, so we are working through this issue during this next week. There may be a way around this by using a specific type of cluster.

Goal

Weekly goal(s) - In the next week, I will be using DataBricks to run Apache Spark on a small dataset from the Kaggle competition with Alyssa by figuring out how to run jobs on a cluster in DataBricks. This will get me introduced to using DataBricks for keyword prediction, so that we may use it on our data.
Long-term goal(s) - Predict keywords by modifying the process of a propose method for keyword prediction in a Kaggle competition by incorporating eye-tracking. We will also predict keywords without eye-tracking and compare the two keyword sets generated. These keywords inform our tags, so determining them will tell us which pieces of code and/or text in a StackOverflow document are pertinent to tag selection.

Outcome(s)

Made progress on running Spark jobs in DataBricks by loading Kaggle competition data into it and reading over a user guide.

Wednesday, April 12, 2017

Week 28: 4/5/2017 - 4/12/2017

Work Accomplished:
This week, Jenna helped me to upload our training data into DataBricks, the online platform for running Apache Spark. She was able to solve the issue of our file upload. The Train data from the Kaggle Competition is now running. Jenna and I are going to figure out how to run it through a cluster. I am going to start writing the draft of our paper and gather the rest of the data for analysis.

Weekly Goal: Get the rough draft of our paper written.

Future Goal: Complete data analysis and machine learning. Submit our paper to be published.

Thursday, April 6, 2017

Week 27: 3/29/17 - 4/5/2017

Work Accomplished

This past week I presented my poster at QUEST (see the image below); a conference hosted at our university for our university's students to present their research. I also checked over Alyssa's generated keywords and I generated my own keywords for our without eye-tracking keyword predictions. It was difficult to think of keywords that might be important for tag prediction, but that Alyssa hadn't already thought of before. Finally, I ran my pre-processing scripts (which I combined into one script) on our collected eye-tracking data.

Goal

Weekly goal(s) - In the next week, I will be using DataBricks to run Apache Spark on a small dataset from the Kaggle competition. This will get me introduced to using DataBricks for keyword prediction, so that we may use it on our data.
Long-term goal(s) - Predict keywords by modifying the process of a propose method for keyword prediction in a Kaggle competition by incorporating eye-tracking. We will also predict keywords without eye-tracking and compare the two keyword sets generated. These keywords inform our tags, so determining them will tell us which pieces of code and/or text in a StackOverflow document are pertinent to tag selection.

Outcome(s)

QUEST poster presentation completed
Checked over Alyssa's keywords and added my own
Ran pre-processing script on collected data

Week 27: 3/29/17 - 4/5/2017

Work Accomplished:
This week, I finished the list of Keyword AOIs from the 9 tasks in our Stack Overflow study. I sent them to Jenna to check, but I found it easier to do a manual analysis than waiting for the data to export. I also presented 'Improving Stack Overflow Tag Prediction', the first half of our study, at Youngstown State University's QUEST, a forum for undergraduate and graduate research. Jenna and I presented a poster. We are also using Databricks, a virtual analytics platform, to run Apache Spark. I uploaded the train data from the Kaggle competition as a cluster and I'm going to run a bigger file (also part of the Kaggle Competition) to see how the two compare against one another. We are then going to take our participant fixation data and run it in the online forum.

Weekly Goal: Get Alex scheduled to participate in the study.

Future Goal: Start writing the abstract for the second half of our study to be submitted. Finish the data analysis.