Automated Product Tagging with Python and NLP | ITEA Journal

SEPTEMBER 2024 I Volume 45, Issue 3

From Text to Metadata: Automated Product Tagging with Python and Natural Language Processing

Aayushi Verma

Aayushi Verma

Data Science Fellow
Institute for Defense Analyses
Alexandria, VA

Omar Agha Khan

Omar Agha Khan

Data Science Fellow
Institute for Defense Analyses
Alexandria, VA

DOI: 10.61278/itea.45.3.1007

Abstract

The Institute for Defense Analyses (IDA) produces a variety of research deliverables such as reports, memoranda, slides, and other formats for our sponsors. Summarizing keywords from these products quickly and for efficient retrieval of information on given research topics poses a challenge. IDA has numerous initiatives for tagging products with IDA-defined taxonomies of research terms, but this is a manual and time-consuming process and must be repeated periodically to tag newer products. To address this challenge, we developed a Python-based automated tagging pipeline. In this article, we introduce the mechanics of this pipeline, highlight current results, and discuss future applications for analyzing IDA’s research in terms of these tags.

Keywords: Natural language processing; metadata curation; automation; python; keyword extraction

1. Introduction

The Institute for Defense Analyses (IDA) is a U.S-based non-profit organization that manages multiple federally funded research and development centers (FFRDCs). IDA works with sponsoring federal departments and agencies to address challenging questions about U.S. national security and science policy across a range of topics and domains. IDA’s research is typically delivered to its sponsors in products, such as reports or memoranda.

As data is a core component of its objective and rigorous analyses, IDA is building and implementing a data strategy to develop solid, sustainable infrastructures and practices for leveraging data to efficiently respond to sponsor and business needs. The strategy promotes a data-first approach through multiple initiatives: making data accessible, creating common language around data, streamlining data flows, investing in data literacy, and improving and measuring the effects of these initiatives.

A key focus of IDA’s data strategy is to enable better, data-driven insights about the organization from an operations perspective. To address this goal, IDA has implemented key initiatives to address these challenges: specifically, research taxonomies, and a newly developed automatic tagging pipeline to reduce the time and effort needed to tag IDA products according to these taxonomies.

Research Taxonomies

A key initiative of the data strategy is the development of a set of research taxonomies. These taxonomies are an organized hierarchy categorizing IDA’s research across eight broad domains derived from its four core competencies (see Figure 1). These taxonomies are an evolving set of terms that are governed by IDA’s Research Taxonomy Governance committee, who assess the relevance of current and proposed terms, and their structure in the hierarchies.

Research TaxonomiesFigure 1. Taxonomies Derived from IDA’s Four Core Competencies

The taxonomy terms are metadata that describe the research. Since the development of these taxonomies began in 2021, IDA researchers have used them to tag research products and projects (Verma 2023), and future work on this initiative involves tagging researchers, research divisions, sponsors, and more.

To begin tagging research, IDA authors manually tagged a subset their deliverables from the last five years with terms from the research taxonomies. These efforts helped to create a dataset comprised of the tagged products with the taxonomy terms used as metadata. This dataset has been immensely useful for quantifying IDA’s research and has led to the development of a dashboard to answer specific meta-questions about IDA’s research (e.g., “Which authors have published the most products for a given research area?” or “What is the research profile of an author?”) (Verma 2023).

However, the manual tagging efforts are periodic and require mass organization and coordination. The efforts are also time-intensive, and they add to the administrative burden of researchers, limiting their time for performing research. Because products are mostly written documents (such as reports and memoranda), we utilized natural language processing methods to automate the tagging of products.

In this article, we describe how we developed an automated tagging pipeline, future work, and the organizational impact. The paper is organized as such: Section 2 (Background) contains relevant background information and brief literature review, Section 3 (Dataset) describes the data we used in our analysis, Section 4 (Methods and Results) describes how we conducted our analysis and the results we obtained, Section 5 (Discussion) details future areas of work we have identified and how we aim to address quality control issues, and finally Section 6 concludes this paper.

2. Background

IDA is not the only organization which has faced the issue of categorizing research documents with descriptive keywords. More generally, keyword extraction is an active area of academic research in natural language processing, used in similar applications like metadata curation, document summarization, topic modeling, and other applications (Bharti and Babu 2017; Choochaiwattana 2010; Firoozeh et al. 2020; Gopan et al. 2020; Hasan et al. 2017; Miah et al. 2021; Papagiannopoulou and Tsoumakas 2020; Thushara, Krishnapriya, and Nair 2017).

Commercial software, such as Google Cloud Natural Language API1 , IBM Watson Natural Language Understanding2 , TextRazor3 , AYLIEN4 , and Lexalytics5 , perform text analytics and natural language processing, including keyword extraction. Alternatively, natural language processing software libraries such as nltk (Bird, Klein, and Loper 2009), scikit-learn (Pedregosa et al. 2011), spaCy (Honnibal et al. 2020), rake-nltk6 , and quanteda (Benoit et al. 2018) are available in popular programming languages like Python and R.

We selected the hands-on approach for our solution implementation using the programming language Python and the nltk (Bird, Klein, and Loper 2009) library for keyword extraction. This is due to several reasons, including availability of and familiarity with pre-existing code, the need to integrate this automated tagging pipeline into IDA’s existing technology ecosystem, and time constraints to demonstrate proof-of-concept.

This paper addresses the issue of extracting keywords from research documents, which maps to an organization-specific taxonomy of terms to create a set of tags serving as metadata for these research documents in an automated process.

Therefore, we devised two approaches for our pipeline. The first approach, direct tagging, matches extracted keywords using the Term Frequency–Inverse Document Frequency (TF-IDF) scoring method from our products to terms from our research taxonomies. TF-IDF scores are a key scoring metric for assessing the relevance of a word in a document relative to its frequency in the document itself and across the entire corpus of documents (Bird, Klein, and Loper 2009; Manning, Raghavan, and Schütze 2008). This is the approach we have implemented and describe in this paper.

The second approach, indirect tagging, will use methods like complex statistical measures and machine learning for inferring semantic meaning from the text. This is future work, which we discuss in Section 5 (Discussion).

 

1 https://cloud.google.com/natural-language
2 https://www.ibm.com/products/natural-language-understanding
3 https://www.textrazor.com/
4 https://aylien.com/
5 https://www.lexalytics.com/
6 https://github.com/csurfer/rake-nltk

3. Data Set

We used a small subset of IDA’s unclassified products published between 2011 and 2023 (1,474 products) as our raw dataset (Figure 2). Many of the papers from 2011–2015 were not available for our use, so the majority of the documents in this subset were published between 2016–2023. These numbers do not reflect the actual number of products published by IDA during these years.

Number of Documents in data set by yearFigure 2. Number of Documents in the Dataset by Year

We were unable to access 4.4% (65) of these products due to the files being inaccessible (due to password protection or non-standard pdf formats) or non-parseable (e.g., .html, .xlsx, .jpg, .R, etc.). Specifically, in this context, “non-parseable” means that extracting text from these formats was outside the scope of the project. However, the majority of the products (95.6%) were easily parseable pdf files. These pdf products ranged from 1 to 799 pages, with an average of 70.84 pages per document.

4. Methods and Results

For this analysis, we used natural language processing methods implemented in Python to develop a pipeline (see Figure 3) for automating the extraction of text from the products in our dataset and accurately tagging them with terms from the taxonomies that best describe the research in the products. We selected Python for this pipeline over other comparable languages like R due to pre-existing code and familiarity with Python for natural language processing.

Visualization auto tagging pipelineFigure 3. Visualization of the Auto-Tagging Pipeline

The first step of this pipeline is to ingest the pdf files from their storage location (Fig. 3, Step 1). We stored our dataset of 1,474 documents on an internal network folder and directed the pipeline to read pdf files from this dataset. In future iterations of this pipeline, we will instead ingest the pdf files in a serialized format from an internal database that integrates with other internal publication systems, as indicated by the dotted line in Fig. 3.

During this step, our code uses the logging Python package, which logs any errors that occur with pertinent details, such as the names and file paths of pdf files that our pipeline was unable to read due to inaccessible or non-parseable formats.

The next step of the process is to extract the text from these pdfs (Fig. 3, Step 2), and then clean the extracted unstructured text (Fig. 3, Step 3). The key steps of text cleaning involve tokenizing7 , stemming8 , lemmatizing9 , and removing stop words10 and special characters (Bird, Klein, and Loper 2009).

We used Python nltk (Bird, Klein, and Loper 2009) and scikit-learn (Pedregosa et al. 2011) packages for these cleaning steps. We found frequent occurrences of boilerplate corporate language in these products, as shown in Figure 4. As these words do not provide insight about the research, we created a custom stop words dictionary to remove them from the text.

 

7 Tokenization is the process of splitting the input text into individual words and punctuation, each of which are called tokens (Bird, Klein, and Loper 2009). For example, the input text string ‘This is a sentence’ is tokenized to obtain ‘This’, ‘is’, ‘a’, ‘sentence’, each of which are tokens.
8Stemming is the process of removing suffixes from word endings, for example, ‘processing’ is stemmed to ‘process’.
9Similarly, lemmatization is the process of using the base or dictionary form of a word, called a lemma (Manning, Raghavan, and Schütze 2008). For example, each of ‘is’, ‘am’, and ‘are’ are lemmatized to their base lemma, ‘be’.
10Stop words are very frequently-used words whose inclusion does not add meaning to the semantic analysis, for example, ‘the’, ‘and’, ‘when’, etc.

 

‘executive summary’, ‘project leader’, ‘this page is intentionally blank’, ‘approved for public release’, ‘distribution is unlimited’, ‘about this publication’, ‘the work was conducted by the institute for defense analyses ida’, ‘institute for defense analyses mark center drive  alexandria virginia’, ‘copyright license under the clause’, ‘us government pursuant’, ‘approved’, ‘report’

Figure 4. Examples of Boilerplate Language Found in Extracted Text

We then obtained term frequency–inverse document frequency (TF-IDF) scores for the cleaned text of each document (Fig. 3, Step 4). TF-IDF scores are a key scoring metric for assessing the relevance of a word in a document relative to its frequency in the document itself and across the entire corpus of documents (Bird, Klein, and Loper 2009; Manning, Raghavan, and Schütze 2008).

We assessed two Python packages, PDFplumber (Singer et al., 2024) and PyMuPDF (Bird, Klein, and Loper 2009), for their quality of extracting unstructured text from pdf files. We compared the amount of time taken by each extracted text from documents, the length of the extracted text, and the most important words returned by the TF-IDF scoring method.

We selected a random subset of five products from our dataset, and used both PDFplumber and PyMuPDF to extract the text from the pdf files. Table 1 shows how long (measured in seconds) each package took to extract text from each pdf. PyMuPDF was fastest in all cases.

Table 1. Time to Extract Text (Measured in Seconds)

Document PDFplumber PyMuPDF Time Difference
Test Document 1 19.99 0.43 19.56
Test Document 2 5.15 0.10 5.05
Test Document 3 8.46 0.15 8.31
Test Document 4 1.21 0.12 1.09
Test Document 5 2.66 0.06 2.60

 

Table 2 shows the amount of text extracted from each document by both packages, as measured by numbers of characters. In all cases, PyMuPDF extracted the greatest number of characters.

Table 2. Length of Text Extracted (Measured in Number of Characters)

Document PDFplumber PyMuPDF Character Difference
Test Document 1 512260 530503 -18243
Test Document 2 108002 110890 -2888
Test Document 3 161914 165704 -3790
Test Document 4 16511 17905 -1394
Test Document 5 39091 40252 -1161

We also compared the TF-IDF scores for each document by package. Figure 4 displays the difference (delta) in the TF-IDF scores returned by PDFplumber and PyMuPDF for two test documents out of the five test documents, for simplicity. The graph on the right shows highly variable differences between results for one document; however, the TF-IDF differences depicted in the graph on the left are negligible, indicating both packages tend to identify similar keywords with similar TF-IDF importance scores.

TF-IDF Delta Scores for Two Test Documents

Figure 4. TF-IDF Delta Scores for Two Test Documents

We found that PyMuPDF produced the best results in terms of time to extract text from documents, length of text extracted, and relevance of important terms and thus used it in our pipeline.

As we created this pipeline to tag products with terms from IDA’s research taxonomies, we filtered this list of words with terms from the taxonomies, using the direct tagging method (Figure 3, Step 5), described in Section 2 – Background.

In the direct tagging process, we cleaned and lemmatized the taxonomies and only kept words from the document that exactly matched taxonomic terms. IDA has developed an internal application programing interface (API) to access and facilitate the governance of the research taxonomies. We use this API to pull the taxonomies from their location on the internal enterprise database and store them in an organized hierarchy for the direct tagging process.

Next, we combined this dataset of products, their directly tagged terms from the research taxonomies, and their TF-IDF importance scores with other metadata about the product, such as product file path, date published, and author names. (Figure 3, Step 6) to create an enriched product dataset.

We then performed quality assurance checks, such as ensuring that the directly-tagged terms matched terms from the Taxonomies and examining anomalous TF-IDF scores (Figure 3, Step 7). In this process, we generated visualizations to help assess the outputs from the direct-tagging process that helped in our quality assurance checks. Figure 5 shows an example visualization for a single test document, depicting the TF-IDF scores returned by the algorithm for keywords in this document which match terms from the research taxonomies. From this figure, we can see the algorithm determined the most important keywords in the document to be ‘mars’, ‘space’, ‘lunar’, ‘exploration’, etc., which match exactly terms in the research taxonomies.

Example Visualization of Auto-Tagging Results for a Single Test Document.

Figure 5. Example Visualization of Auto-Tagging Results for a Single Test Document.

Finally, we stored this dataset to our internal SQL Server database for use by other applications (Figure 3, Step 8) in IDA’s technology infrastructure.

5.Discussion

This pipeline is a minimum viable product and the first step toward a fully automated and functional pipeline. We implemented only the direct tagging approach and identified several areas for further work.

The direct tagging process was successful in extracting key terms and their TF-IDF importance scores from products; however, we must also develop robust mechanisms for ensuring these extracted key terms reflect reality. We are currently working to address the following questions about the tagging quality.

  • Do the pipeline’s tags accurately describe the body of work?
  • Would a human tag the product with the same or similar terms?
  • Is the tagging process consistent across different lines of research?
  • How do we consider context of sentences, paragraphs, areas of research?
  • How do we deal with ambiguous terms like “space,” “land,” etc.?

To address the first question, ‘Do the pipeline’s tags accurately describe the body of work?’, we aim to devise a quality assessment scheme to assess how well the outputted tags reflect the body of work. These outputted tags should ideally be representative of the main line of research described in the product to an appropriate degree of detail.

To address the second question, ‘Would a human tag the product with the same or similar terms?’, we aim to use the results of previous manual tagging (described in Section 2: Research Taxonomies) to compare to the results of our auto-tagging algorithm. Specifically, we aim to accomplish this by comparing the tags for products which were manually tagged, with the tags produced for the same product by the auto-tagging algorithm. For this analysis, we will initially ask Subject-Matter Experts (SMEs) at IDA to provide feedback on the difference in tags, and based on their feedback, aim to create an automated method for quantitatively assessing the likelihood for a human author tagging the product with the tags returned by the auto-tagging algorithm.

Similarly, the third question of ‘Is the tagging process consistent across different lines of research?’ requires assessing the quality of tags returned for products pertaining to different areas of research. This quality measure is inherently tied to the research taxonomies, which evolve over time, as a result of which some parts of the taxonomies are more well-defined and organized, and have more terms than others. As a result, if there is a product whose body of work pertains to the taxonomies that are more well-defined and have more terms, the auto-tagging algorithm may return more tags for that product, as compared to a product whose body of work pertains to a less well-defined part of the taxonomies, which may return fewer tags. We aim to investigate methodologies for assessing and dealing with this particular quality issue in future work.

The fourth and fifth questions of ‘How do we consider context of sentences, paragraphs, areas of research?’ and ‘How do we deal with ambiguous terms like “space,” “land,” etc.?’ deal with the issue of semantics of the English language. If our auto-tagging algorithm is able to understand the context of the body of work, of paragraphs and sentences, and of ambiguous terms, such as ‘space’ or ‘land’, then it will help in tagging the product with suitable terms. For this quality issue, we will move towards indirect methods of tagging, namely using machine learning methods for better understanding of the context, as mentioned in Section 2 (Background).

This is especially important because although the TF-IDF method is the most elementary approach for keyword extraction (Firoozeh et al. 2020), it is ultimately just a statistical measure of how often terms appear in the text compared to the corpus and collection of documents, and is unable to infer semantic meaning, which is necessary for extracting complex terms from the text (Papagiannopoulou and Tsoumakas 2020; Bharti and Babu 2017).

In terms of machine learning methods for keyword extraction, there are two approaches: supervised machine learning, and unsupervised. The supervised approach requires labeled training data, i.e. in our training dataset, the text of our corpus should be labeled with whether or not it is a keyword. There are many algorithms that will then classify a product’s text with keywords, for example the Keyphrase Extraction Algorithm (KEA) algorithm, which uses a Naïve Bayes classifier and returns the probability of extracted phrases being keywords (Frank et al. 1999), support vector machine classifiers (Zhang et al. 2006), neural networks like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) (Augenstein et al. 2017; Wang and Zhang 2017), and more.

Unsupervised methods for keyword extraction do not depend on labeled training data. Some examples include the Rapid Automatic Keyword Extraction (RAKE) algorithm (Rose et al. 2010; Thushara, Krishnapriya, and Nair 2017), N-grams and parts-of-speech (POS) tagging (Hulth 2003), the TextRank algorithm (Mihalcea and Tarau 2004), Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003), and more.

11 Other statistical measures for keyword extraction include co-occurrence-based features, which measure how often terms appear together (Firoozeh et al. 2020), and similarity scores which measure the distance between vectors of terms using measures like the cosine similarity, Dice coefficient, and Jaccard index (Firoozeh et al. 2020; Manning, Raghavan, and Schütze 2008).

Because of IDA’s prior manual tagging efforts, we have a set of products with human-labeled tags from IDA’s research taxonomies. We can use this dataset for our future work on machine learning methods for keyword extraction to better grasp the complex semantics of the English language and of IDA’s research taxonomies, starting with the supervised machine learning approaches identified.

Therefore in our future work on this pipeline, we hope to address the identified quality assurance questions, and move toward indirect methods of tagging (i.e., utilizing machine learning methods to understand the context and semantics of the text). We also hope to work with subject matter experts to help validate and refine the auto-tagging results and to inform the quality assurance process.

We also aim to deploy this pipeline on our internal software ecosystem to automatically tag new products. Once deployed, the pipeline will present researchers a base set of these automatically generated terms and allow them to simply add extra terms or remove irrelevant terms. This will reduce the administrative burden on product authors to tag their products, and this human-in-the-loop capability will ensure some human oversight for the automatically generated product terms.

6.Conclusion

We demonstrated the development of an automatic tagging pipeline for directly tagging IDA products with terms from our research taxonomies. We compared and used several Python packages for implementing natural language processing methods to extract text from products, clean and pre-process the text, extract relevant key terms from products, assess their importance scores, and directly match terms from IDA’s research taxonomies. Future work would include developing quality assurance processes in cooperation with subject matter experts, developing indirect tagging methods, and deploying this capability for IDA-wide use.

Acknowledgements

The authors would like to thank Heather Wojton, Matthew Avery, and Shawn Whetstone for their support of this project, and Jordan Marcusse, Sara Jordan, William Doane for their technical review of this project. The authors would also like to thank the ITEA editorial team for their invaluable review of this article.

References

Artifex Software, Inc. (2024). PyMuPDF (Version 1.23.26) [Computer software]. Available at https://github.com/pymupdf/PyMuPDF.
Augenstein, Isabelle, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum. “SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications.” In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), edited by Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens, 546-555. Vancouver, Canada: Association for Computational Linguistics, 2017. https://doi.org/10.18653/v1/S17-2091.

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3, no. 30 (2018): 774. https://doi.org/10.21105/joss.00774.

Bharti, Santosh Kumar, and Korra Sathya Babu. “Automatic Keyword Extraction for Text Summarization: A Survey.” Last modified 2017. arXiv. https://arxiv.org/abs/1704.03242.

Bird, Steven, Ewan Klein, and Edward Loper. “Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit”. Sebastopol, CA: O’Reilly Media, Inc., 2009.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” The Journal of Machine Learning Research 3 (2003): 993-1022.

Choochaiwattana, Worasit. “Usage of Tagging for Research Paper Recommendation.” 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), 439-442. Chengdu, China, 2010. https://doi.org/10.1109/ICACTE.2010.5579321.

Fenniak, Mathieu, Matthew Stamy, pubpub-zz, Martin Thoma, Matthew Peveler, exiledkingcc, and pypdf Contributors. (2024). pypdf (Version 4.2.0) [Computer software]. https://pypi.org/project/pypdf/.

Firoozeh, Nazanin, Adeline Nazarenko, Fabrice Alizon, and Béatrice Daille. “Keyword Extraction: Issues and Methods.” Natural Language Engineering 26, no. 3 (2020): 259–91. https://doi.org/10.1017/S1351324919000457.

Frank, E., G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. “Domain-Specific Keyphrase Extraction.” In Proceedings of the 16th International Joint Conference on Artificial Intelligence, 668-673. Stockholm, Sweden, 1999. San Francisco, USA: Morgan Kaufmann Publishers.

Gopan, E., S. Rajesh, G. R. Vishnu, A. Raj R., and M. G. Thushara. “Comparative Study on Different Approaches in Keyword Extraction.” In 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), 70-74. Erode, India, 2020. https://doi.org/10.1109/ICCMC48092.2020.ICCMC-00013.

Hasan, H. M. Mahedi, F. Sanyal, D. Chaki, and M. H. Ali. “An Empirical Study of Important Keyword Extraction Techniques from Documents.” In 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 91-94. Aurangabad, India, 2017. https://doi.org/10.1109/ICISIM.2017.8122154.

Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-Strength Natural Language Processing in Python. 2020. https://doi.org/10.5281/zenodo.1212303.

Hulth, Anette. “Improved Automatic Keyword Extraction Given More Linguistic Knowledge.” In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 216-223. 2003. https://aclanthology.org/W03-1028.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. “Introduction to Information Retrieval”. Cambridge University Press, 2008.

Miah, M. S. U., J. Sulaiman, S. Azad, K. Z. Zamli, and R. Jose. “Comparison of Document Similarity Algorithms in Extracting Document Keywords from an Academic Paper.” In 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), 631-636. Pekan, Malaysia, 2021. https://doi.org/10.1109/ICSECS52883.2021.00121.

Mihalcea, Rada, and Paul Tarau. “TextRank: Bringing Order into Text.” In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, edited by Dekang Lin and Dekai Wu, 404-411. Barcelona, Spain: Association for Computational Linguistics, July 2004. https://aclanthology.org/W04-3252.

Papagiannopoulou, Eleni, and Grigorios Tsoumakas. “A Review of Keyphrase Extraction.” WIREs Data Mining and Knowledge Discovery 10 (2020): e1339. https://doi.org/10.1002/widm.1339.

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (2011): 2825-2830.

Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. “Automatic Keyword Extraction from Individual Documents.” In Text Mining, edited by Michael W. Berry and Jacob Kogan. 2010. https://doi.org/10.1002/9780470689646.ch1.

Singer-Vine, J., & The pdfplumber contributors. (2024). pdfplumber (Version 0.10.4) [Computer software]. https://github.com/jsvine/pdfplumber

Thushara, M. G., M. S. Krishnapriya, and S. S. Nair. “A Model for Auto-Tagging of Research Papers Based on Keyphrase Extraction Methods.” In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 1695-1700. Udupi, India, 2017. https://doi.org/10.1109/ICACCI.2017.8126087.

Wang, Y., and J. Zhang. “Keyword Extraction from Online Product Reviews Based on Bi-Directional LSTM Recurrent Neural Network.” In 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), 2241-2245. Singapore, 2017. https://doi.org/10.1109/IEEM.2017.8290290.

Verma, Aayushi. 2023. I-TREE: A Tool for Characterizing Research Using Taxonomies. The ITEA Journal 44 (3).

Zhang, K., H. Xu, J. Tang, and J. Li. “Keyword Extraction Using Support Vector Machine.” In Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, edited by J.X. Yu, M. Kitsuregawa, and H.V. Leong, vol. 4016, 85-96. Berlin, Heidelberg: Springer, 2006. https://doi.org/10.1007/11775300_8.

Appendix

This is a larger, reader-friendly version of Fig. 3.

appendix 45-3
Figure A-1. Larger Visualization of the Auto-Tagging Pipeline

Author Biographies

Aayushi Verma is a Data Science Fellow at the Institute for Defense Analyses (IDA), where she collaborates with the Chief Data Officer to drive IDA’s data strategy. She has developed numerous data pipelines and visualization dashboards to bring data-driven insights to staff. Her data science interests include machine learning/deep learning, image processing, and extracting stories from data. Aayushi holds an M.S. in Data Science from Pace University, and a B.Sc. (Hons.) in Astrophysics from the University of Canterbury.

Omar Agha Khan is a Data Science Fellow at the Institute for Defense Analyses (IDA), where he contributes to both the data initiative, led by the Chief Data Officer and the execution, operations, and planning needs of the Operational Evaluation Division (OED) at IDA. By integrating data pipelines and developing engaging dashboards and apps, he has enhanced the staff’s ability to derive actionable insights from data. Omar’s data science interests include machine learning/deep learning, natural language processing, and large language models. Omar holds an M.S. in Data Science and a B.Sc. in Biological Sciences from the University of Maryland Baltimore County.

ITEA_Logo2021
  • Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!