CREU 2016-2017: December 2016

Saturday, December 17, 2016

Week 15: 12/06/2016 - 12/13/2016

I have concluded my contribution to the study in regards to how developers read & comprehend stack overflow questions for tag assignment. I went in with the objectives of figuring out where developers focus most and what were the most valuable areas of interest (AOIs). What I discovered is that often time, fixation count and fixation duration distributions correlated among AOIs I defined. I found that when questions get more complex, participants (especially those with more experience programming) spend more time on code and less time on title. I found that overall those with more experience use the code more and more to assign tags while those with less experience might rely on plain text such as title and description to assign tags. Keywords were an important feature in the questions for users as they fixated on them early and often revisited. I hope this will be useful in creating a weighting system for tag prediction as my team moves on to that. My complete study presentation and write up can be found on the Research page of our website.

Tomorrow is graduation for me. I had a really great time working on this project and working in the field of eye-tracking was very interesting for me. I am excited to move into my career and take all this valuable knowledge with me.

Week 15: 12/06/2016 - 12/13/2016

This past week has been really hectic with finals and graduate school applications. I worked on and completed my end of the year report for the CREU program. I also refreshed my memory on Spark Machine Learning, through a tutorial Dr. Lazar suggested we try. I looked over my notes from my summer internship where I learned how to use Spark Machine Learning for the first time. The Spark Programming Guide offers a more comprehensive tutorial on Spark ML, which I spent time looking over. It is easy to get up and running very quickly with Spark and it is comprehensive in the machine learning algorithms that are available. Spark will be an excellent tool for our data analytics phase.

I also finished seven graduate school applications this week! I am looking forward to getting feedback in late February, early March.

Tuesday, December 6, 2016

Week 14: 11/29/2016 - 12/06/2016

This week I attended the MLH Local Hack Day at YSU on Saturday. Jenna gave an interactive talk on setting up Spark, along with a Scala code tutorial of some simple loops and functions. I was a little intimidated by the syntax, but I think over time I will adjust to the language. I viewed her powerpoint presentation on running Tweet data through Spark, but we are meeting up on Friday to continue the learning process. We now have 6 participants in our study and will soon be moving on to the machine learning phase of our project. I am eager to continue moving forward with this phase as I continue to learn more in my field.

Week 14: 11/29/2016 - 12/06/2016

This week I concluded the data collection phase. I was able to capture a varying range of people in regards to C/C++ experience. All the participants were YSU students, the majors that participated were Computer Science and Electrical Engineering. While I was hoping to get other majors from the CSIS department to compare gaze-data I think having 2 majors will be enough to compare. The process went smoothly, I was able to keep all the collected gaze-data and I also learned a lot from even moderating. After just visually analyzing (i.e. looking over gaze-data representations, no tools) I can already determine a few trends. For example, those who have less experience with C/C++ use the title and question text more to assign tags, especially for the more complicated tasks, versus using the code. It also seems that those with more C/C++ experience were better able to assign tags that apply more to the question solution versus obvious things found directly in the text/code, this was expected. I plan to incorporate simple observations like this, as I think they are useful in interpreting how tags were selected. In this upcoming week I will do the following:
1. Analyze data as a whole - use Tobii to look into fixation count, duration count, and time to first fixation. I hope to compare how people considered oracle (positive) tags vs the distractors (negative) in coming to their tag selections.
2. Compare data from different levels of experience - consider how people came to the correct/incorrect conclusions based on their experience levels and try to determine common trends.

Furthermore, I want to use the gathered data to determine keywords that should award higher weights to suggested tags. I think this is something that will be helpful, especially in the future when applying the machine learning algorithms.

Week 14: 11/29/2016 - 12/06/2016

This past week I prepared and presented an introductory tutorial on Scala and Apache Spark for YSU's Local Hack Day. I demonstrated how to set up a Scala and Spark program using Maven in IntelliJ. I explained basic programming concepts in Scala, such as data types, classes, objects, functions, anonymous functions. I also explained basic programming concepts of the Apache Spark Streaming API, such as SparkContexts and StreamingContexts. All of the concepts discussed in my workshop can be found in a PowerPoint presentation on the LHD Workshop tab of my website: http://jlwise.people.ysu.edu/ . This work is important to our CREU project, because we will be using Scala and Spark ML to analyze the data Ali collected on software developers viewing Stack Overflow questions. I also learned how to install Scala and Spark in IntelliJ without the need to download the Scala and Spark libraries directly.

Here is a photo of me presenting at Local Hack Day:

Friday, December 2, 2016

Week 13: 11/22/2016 - 11/29/2016

This past week I read and summarized the paper, Multi-Label Classification: An Overview, as a PowerPoint presentation. The paper explains 14 different multi-label classification algorithms (although many of the algorithms overlap or are variations of each other). This paper also experimentally compares three of the algorithms using Hamming Loss and accuracy metrics. It determined that the transformation algorithms PT3 and PT4 predicts the best with accuracy and Hamming Loss metrics respectively. I learned what multi-label classification is compared to single-label classification. Multi-label classification assigns multiple prediction labels to a single data sample instead of a single prediction label to a single data sample. Multi-label classification makes the most sense for the Stack Overflow tag prediction analysis we will be performing next semester, because we want to label a question with multiple tags ("prediction labels").