This function takes advantage of the hierarchical structure of the ESCO-ISCO mapping and matches multilingual free-text with the ESCO occupations vocabulary in order to map semi-structured vacancy data into the official ESCO-ISCO classification.

classify_occupation(
corpus,
id_col = "id",
text_col = "text",
lang = "en",
num_leaves = 10,
isco_level = 3,
max_dist = 0.1,
string_dist = NULL
)

## Arguments

corpus A data.frame or a data.table that contains the id and the text variables. The name of the id variable. The name of the text variable. The language that the text is in. The number of occupations/neighbors that are kept when matching. The ISCO level of the suggested occupations. Can be either 1, 2, 3, 4 for ISCO occupations, or NULL that returns ESCO occupations. String distance used for fuzzy matching. The amatch function from the stringdist package is used. String dissimilarity measurement. Available string distance metrics: stringdist-metrics.

## Value

Either a data.table with the id, the preferred label and the suggested ESCO occupation URIs (num_leaves predictions for each id), or a data.table with the id, the preferred label and the suggested ISCO group of the inputted level (one for each id).

## Details

First, the input text is cleansed and tokenized. The tokens are then matched with the ESCO occupations vocabulary, created from the preferred and alternative labels of the occupations. They are joined with the tfidf weighted tokens of the ESCO occupations and the sum of the tf-idf score is used to retrieve the suggested ontologies. Technically speaking, the suggested ESCO occupations are retrieved by solving the optimization problem, $$\arg\max_d\left\{\vec{u}_{binary}\cdot \vec{u}_d\right\}$$ where, $$\vec{u}_{binary}$$ stands for the binary representation of a query to the ESCO-vocabulary space, while, $$\vec{u}_d$$ is the ESCO occupation normalized vector generated by the tf-idf numerical statistic. If an ISCO level is specified, the k-nearest neighbors algorithm is used to determine the suggested occupation, classified by a plurality vote in the corresponding hierarchical level of its neighbors.

Before the suggestions are returned, the preferred label of each suggested occupation is added to the result, using the occupations_bundle and isco_occupations_bundle as look-up tables.

## References

M.P.J. van der Loo (2014). The stringdist package for approximate string matching. R Journal 6(1) pp 111-122.

Gweon, H., Schonlau, M., Kaczmirek, L., Blohm, M., & Steiner, S. (2017). Three Methods for Occupation Coding Based on Statistical Learning, Journal of Official Statistics, 33(1), 101-122.

Arthur Turrell, Bradley J. Speigner, Jyldyz Djumalieva, David Copple, James Thurgood (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings.

ESCO Service Platform - The ESCO Data Model documentation

## Examples

corpus <- data.frame(
id = 1:3,
text = c(
"Junior Architect Engineer",
"Cashier at McDonald's",
"Priest at St. Martin Catholic Church"
)
)
classify_occupation(corpus = corpus, isco_level = 3, lang = "en", num_leaves = 5)#>    id iscoGroup                                          preferredLabel
#> 1:  1       214 Engineering professionals (excluding electrotechnology)
#> 2:  2       523                              Cashiers and ticket clerks
#> 3:  3       263                      Social and religious professionals