Machine Learning for Resume Matching

Resume Matcher

Business leaders at companies all across the world spend too much time sifting through resumes to identify candidates for open job requirements. Time is wasted, hiring managers are left frustrated, and candidates leave with a negative outlook of a company. There is an opportunity to use advanced analytics and Natural Language Processing (NLP) to better match job opportunities with qualified candidates.

Wouldn’t it be nice to spend less time reviewing resumes, and more time interacting with appealing candidates? Different companies places emphasis on skills listed in the resume, colleges, location, work history, or any combination of these factors.

Mosaic Data Science, a leading big data consulting company, has built a solution using a probabilistic model which ranks the potential candidate using machine learning techniques and NLP to optimize the time spent by Human Resource (HR) professionals, business leaders and hiring managers. Mosaic develops a powerful classification model to give candidates with a higher likelihood of a job offer higher preference.

The tool uses a supervised classification machine learning model to rank submitted resumes and recommend which resumes should be given further consideration. The model should take as training input the text of prior years’ resumes (the more the better), candidate tracking data, and decisions made for this group as to whether or not they were selected for phone screen.

The performance of the classification model will be optimized based on standard probabilistic evaluation metrics, such as sensitivity and specificity. Mosaic will work with stakeholders to determine the combination of metrics and objectives that best align with the business processes and costs for candidate ranking. Pre-processing of resumes will include parsing and tagging of resumes, and application of a simple ontology for identification of the work history, location, degree program, specific skills, etc.

The models will use term frequency-inverse document frequency (tf-idf) features on both uni-grams and bi-grams, as well as other specific features based on school distance, prestige, and desirability of specific skills from the ontology. Mosaic will compare the performance of multiple modeling approaches, including logistic regression, naïve Bayes, support vector machine, and random forests, and will select the modeling approach that performs best on the tagged data set according to the selected metrics.  Accepted cross-validation practices will be followed to ensure that the model performance generalizes to resumes outside of the training set.

In scoring mode, the model will use stored parameters to compute scores for new candidates.

Typical engagement structure looks like this:

  • Build the Mosaic project team’s understanding of the business objectives and requirements for the project
  • Define key business metrics for assessing the opportunity and ROI on the project
  • Review current methodology for decisions used to rank candidates
  • Define how insights and predictions from the model will be integrated into the hiring process
  • Preliminary data assessment, integration, and basic analysis to determine any problems with data cleanliness, completeness, or alignment between sources
  • Discussion of additional analyses to be performed in the future
  • Discussion of possible modeling approaches for ranking candidates
  • Define technical requirements on a potential solution that align with business objectives and requirements
  • Specification for how further collaboration will occur during the project (regular update meetings with the stakeholder team, ad hoc collaboration between scheduled updates, etc.)


One client saw an improvement of matching accuracy increase by over 40% when switching from a keyword search model to our resume matcher!