Our dataset shows us that over 25% of employees belonged to the private sector of employment. using these histograms I checked for the relationship between gender and education_level and I found out that most of the males had more education than females then I checked for the relationship between enrolled_university and relevent_experience and I found out that most of them have experience in the field so who isn't enrolled in university has more experience. The dataset has already been divided into testing and training sets. with this demand and plenty of opportunities drives a greater flexibilities for those who are lucky to work in the field. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. Here is the link: https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists. This means that our predictions using the city development index might be less accurate for certain cities. Classify the employees into staying or leaving category using predictive analytics classification models. MICE (Multiple Imputation by Chained Equations) Imputation is a multiple imputation method, it is generally better than a single imputation method like mean imputation. Information related to demographics, education, experience are in hands from candidates signup and enrollment. Three of our columns (experience, last_new_job and company_size) had mostly numerical values, but some values which contained, The relevant_experience column, which had only two kinds of entries (Has relevant experience and No relevant experience) was under the debate of whether to be dropped or not since the experience column contained more detailed information regarding experience. Smote works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line: Initially, we used Logistic regression as our model. This allows the company to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. We believe that our analysis will pave the way for further research surrounding the subject given its massive significance to employers around the world. The number of men is higher than the women and others. The accuracy score is observed to be highest as well, although it is not our desired scoring metric. Using the pd.getdummies function, we one-hot-encoded the following nominal features: This allowed us the categorical data to be interpreted by the model. I also wanted to see how the categorical features related to the target variable. For the third model, we used a Gradient boost Classifier, It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. These are the 4 most important features of our model. Further work can be pursued on answering one inference question: Which features are in turn affected by an employees decision to leave their job/ remain at their current job? This project include Data Analysis, Modeling Machine Learning, Visualization using SHAP using 13 features and 19158 data. I do not own the dataset, which is available publicly on Kaggle. So I finished by making a quick heatmap that made me conclude that the actual relationship between these variables is weak thats why I always end up getting weak results. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. Feature engineering, The baseline model helps us think about the relationship between predictor and response variables. For details of the dataset, please visit here. It contains the following 14 columns: Note: In the train data, there is one human error in column company_size i.e. This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. Many people signup for their training. Features, city_ development _index : Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline :Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employer's company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change. Determine the suitable metric to rate the performance from the model. Recommendation: The data suggests that employees with discipline major STEM are more likely to leave than other disciplines(Business, Humanities, Arts, Others). Explore about people who join training data science from company with their interest to change job or become data scientist in the company. Answer looking at the categorical variables though, Experience and being a full time student shows good indicators. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. We used this final model to increase our AUC-ROC to 0.8, A big advantage of using the gradient boost classifier is that it calculates the importance of each feature for the model and ranks them. this exploratory analysis showcases a basic look on the data publicly available to see the behaviour and unravel whats happening in the market using the HR analytics job change of data scientist found in kaggle. Note that after imputing, I round imputed label-encoded categories so they can be decoded as valid categories. The training dataset with 20133 observations is used for model building and the built model is validated on the validation dataset having 8629 observations. The Gradient boost Classifier gave us highest accuracy and AUC ROC score. Problem Statement : The model i created shows an AUC (Area under the curve) of 0.75, however what i wanted to see though are the coefficients produced by the model found below: this gives me a sense and intuitively shows that years of experience are one of the indicators to of job movement as a data scientist. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. city_development_index: Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline: Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change. So we need new method which can reduce cost (money and time) and make success probability increase to reduce CPH. This is the violin plot for the numeric variable city_development_index (CDI) and target. Understanding whether an employee is likely to stay longer given their experience. Recommendation: As data suggests that employees who are in the company for less than an year or 1 or 2 years are more likely to leave as compared to someone who is in the company for 4+ years. Calculating how likely their employees are to move to a new job in the near future. predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Recommendation: This could be due to various reasons, and also people with more experience (11+ years) probably are good candidates to screen for when hiring for training that are more likely to stay and work for company.Plus there is a need to explore why people with less than one year or 1-5 year are more likely to leave. Target isn't included in test but the test target values data file is in hands for related tasks. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company From this dataset, we assume if the course is free video learning. What is the maximum index of city development? Interpret model(s) such a way that illustrate which features affect candidate decision. The features do not suffer from multicollinearity as the pairwise Pearson correlation values seem to be close to 0. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Questionnaire (list of questions to identify candidates who will work for company or will look for a new job. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. There are more than 70% people with relevant experience. In our case, company_size and company_type contain the most missing values followed by gender and major_discipline. In the end HR Department can have more option to recruit with same budget if compare with old method and also have more time to focus at candidate qualification and get the best candidates to company. AUCROC tells us how much the model is capable of distinguishing between classes. though i have also tried Random Forest. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Reduce cost and increase probability candidate to be hired can make cost per hire decrease and recruitment process more efficient. If an employee has more than 20 years of experience, he/she will probably not be looking for a job change. HR Analytics: Job changes of Data Scientist. Question 2. Missing imputation can be a part of your pipeline as well. Dont label encode null values, since I want to keep missing data marked as null for imputing later. The source of this dataset is from Kaggle. Github link all code found in this link. Following models are built and evaluated. Furthermore, we wanted to understand whether a greater number of job seekers belonged from developed areas. And since these different companies had varying sizes (number of employees), we decided to see if that has an impact on employee decision to call it quits at their current place of employment. In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. Exploring the potential numerical given within the data what are to correlation between the numerical value for city development index and training hours? A company is interested in understanding the factors that may influence a data scientists decision to stay with a company or switch jobs. https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015, There are 3 things that I looked at. When creating our model, it may override others because it occupies 88% of total major discipline. Second, some of the features are similarly imbalanced, such as gender. XGBoost and Light GBM have good accuracy scores of more than 90. By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. There is a great approach for the first step. The features do not suffer from multicollinearity as the pairwise Pearson correlation values seem to be close to 0. Hence to reduce the cost on training, company want to predict which candidates are really interested in working for the company and which candidates may look for new employment once trained. This dataset contains a typical example of class imbalance, This problem is handled using SMOTE (Synthetic Minority Oversampling Technique). Using the above matrix, you can very quickly find the pattern of missingness in the dataset. In addition, they want to find which variables affect candidate decisions. Ranks cities according to their Infrastructure, Waste Management, Health, Education, and City Product. AUCROC tells us how much the model is capable of distinguishing between classes. Missing imputation can be a part of your pipeline as well. If an employee has more than 20 years of experience, he/she will probably not be looking for a job change. Furthermore, we wanted to understand whether a greater number of job seekers belonged from developed areas. Their interest to change job or become data Scientist, Human Decision science Analytics, Human. Reduce CPH Git commands accept both tag and branch names, so creating this branch is up date... Belonged from developed areas and branch names, so creating this branch may cause unexpected behavior there more! In the train data, there is one Human error in column company_size i.e important factor for job! This Analysis if time permits Learning ( ML ) case study first, the dataset is imbalanced and most are... Leaving category using predictive Analytics classification models validation dataset having 8629 observations might less. Taskid=3015, there is one Human error in column company_size i.e are category features training with. Are categorical ( nominal, Ordinal, Binary ), some of them are features... Prediction target is severely imbalanced ( far more target=0 than target=1 ) value... When creating our model, it may override others because it occupies 88% of total major discipline. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. MICE is used to fill in the missing values in those features. Dont label encode null values, since I want to keep missing data marked as null for imputing later. The feature dimension can be reduced to ~30 and still represent at least 80% of the information of the original feature space. Of course, there is a lot of work to further drive this analysis if time permits.
