https://eugeneyan.com/rss 09月30日
产品分类API数据准备详解
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文是构建产品分类API系列的第二部分,重点介绍如何准备产品标题(和简短描述)以用于模型训练。文章探讨了数据纯度测量方法,包括如何识别和排除错误分类的产品。接着详细介绍了标题准备步骤,如ASCII编码、转换为小写、分词、去除停用词、数字词、短字符词和重复词,以及最终排除空标题。这些步骤有助于提高数据质量,确保模型训练的准确性。

🔍 数据纯度测量:通过计算具有相同标题和相同类别的产品数量占总产品数量的比例来衡量数据纯度。高纯度表示数据更清洁,有助于识别和排除错误分类的产品。

🎯 标题预处理步骤:包括将标题转换为ASCII编码,转换为小写,使用自定义正则表达式进行分词,去除停用词、数字词、短字符词和重复词,以及排除空标题。这些步骤旨在提高标题的信息密度和准确性。

🧹 去除干扰信息:通过去除停用词(如“free”、“international”)、数字词(如“iPhone 7”)和短字符词(如“TX”),可以减少标题中的噪音,使模型更容易学习产品的主要特征。

🔄 排除不纯数据:最终,通过比较具有相同标题但类别不同的产品,识别并排除“不纯”产品。由于无法确定哪些产品分类正确,这些产品不会被用于模型训练,从而提高模型的泛化能力。

This post is part 2 of the series on building a product classification API. The API is available for demo here. Part 1 available here; Part 3 available here. (Github repositiory)

Update: API discontinued to save on cloud cost.

In part 1, we focused on data acquisition and formatting the categories. Here, we’ll focus on preparing the product titles (and short description, if you want) before training our model.

This is part of a series of posts on building a product classification API:

Measuring data purity

We’ll have products within our data that are categorized incorrectly. How do we exclude these mis-categorized products from our training set?

Here’s one approach: If two products have the same title but different category, we assume that at least one of the products is mis-categorized (and the data is dirty).

Extending on the above, as we take steps to prepare our data, we’ll be measuring data “purity” at each step. In this instance, purity is defined as:

Products with the same title and same category / Total number of products

This measures the proportion of products that have the same title and same category in our data. The higher the purity, the cleaner we can assume our data to be.

At the end of the data preparation, we’ll be able to identify which products are “impure”. Given that we’re unable to distinguish between correctly and incorrectly categorized products, we’ll exclude them from the training of the model.

Preparing the title (and short descriptions)

The titles need a bit of cleaning and preparation before we can train our model on them. In the next steps, we’ll go through some sample data cleaning and preparation procedures.

Encoding titles as ascii

It’s not uncommon to find non-ascii characters in data, sometimes due to sellers trying to add a touch of class to their product (e.g., Crème brûlée), or due to errors in scraping the data (e.g., &quot, &amp, &nbsp).

Thus, before doing any further processing, we’ll ensure titles are properly encoded so that Crème brûlée -> Creme brulee, åöûëî -> aouei, and " &   -> “ & ‘.

Here’s the approach I took:

# Function to encode stringdef encode_string(title, parser=HTML_PARSER):    """ (str) -> str    Returns a string that is encoded as ascii    :param title:    :return:    >>> encode_string('Crème brûlée')    'Creme brulee'    >>> encode_string('åöûëî')    'aouei'    >>> encode_string('Crème brûlée " &  ')    'Creme brulee " & '    """    try:        encoded_title = unicodedata.normalize('NFKD', unicode(title, 'utf-8', 'ignore')).encode('ascii', 'ignore')        encoded_title = parser.unescape(encoded_title).encode('ascii', 'ignore')    except TypeError:  # if title is missing and a float        encoded_title = 'NA'    return encoded_title

There’s quite a bit going on in the code above, so let’s examine it piece by piece:

x = 'Cr\xc3\xa8me & br\xc3\xbbl\xc3\xa9e’; print x# Convert titles into unicodex = unicode(x, 'utf-8', 'ignore'); print x>>> Crème & brûlée# Normalize unicode (errors may crop up if this is not done)x = unicodedata.normalize('NFKD', x); print x>>> Crème & brûlée# Encode unicode into asciix = x.encode('ascii', 'ignore'); print x>>> Creme & brulee# Parse htmlx = HTML_PARSER.unescape(z).encode('ascii', 'ignore'); print x>>> Creme & brulee

Lowercasing titles

Lowercasing titles is a fairly standard step in text processing. We’ll lowercase all title characters before proceeding.

Tokenizing titles

One common way to tokenize text is via nltk.tokenize. I tried it and found it to be significantly slower than using regular regex. In addition, writing our own regex tokeniser gives us flexibility in excluding certain characters that are being used as a split character.

For example, we want to exclude the following words/phrases from being tokenised by splitting on the punctuation character in brackets. Intuitively, the punctuation characters provides essential information; empirically, keeping them led to greater accuracy during model training and validation.

    hyphen-words (-) 0.9 (.) 20% (%) black/red (/)

Here’s how we write our own tokeniser:

# Tokenize stringsdef tokenize_title_string(title, excluded=-/.%'):    """ (str) -> list(str)    Returns a list of string tokens given a string.    It will exclude the following characters from the tokenization: - / . %    :param title:    :return:    >>> tokenize_title_string('hello world', '-.')    ['hello', 'world']    >>> tokenize_title_string('test hyphen-word 0.9 20% green/blue', '')    ['test', 'hyphen', 'word', '0', '9', '20', 'green', 'blue']    >>> tokenize_title_string('test hyphen-word 0.9 20% green/blue', '.-')    ['test', 'hyphen-word', '0.9', '20', 'green', 'blue']    >>> tokenize_title_string('test hyphen-word 0.9 20% green/blue', '-./%')    ['test', 'hyphen-word', '0.9', '20%', 'green/blue']    """    return re.split("[^" + excluded + "\w]+", title)

Removing stop words

After tokenising our titles, we can proceed to remove stop words. The trick is in which stop words to remove. For the product classification API, I found a combination of the following to work well:

    Stop words: nltk.corpus.stopwords Colours: matplotlib.colors.cnames.keys Self-defined: We also define some words that come across as spam, such as “free”, “international”, etc.

At this point, after tokenising the titles, the tokens are stored in a list. We can remove stop words easy and cleanly via list comprehension, like so:

# Remove stopwords from stringdef remove_words_list(title, words_to_remove):    """ (list(str), set) -> list(str)    Returns a list of tokens where the stopwords/spam words/colours have been removed    :param title:    :param words_to_remove:    :return:    >>> remove_words_list(['python', 'is', 'the', 'best'], STOP_WORDS)    ['python', 'best']    >>> remove_words_list(['grapes', 'come', 'in', 'purple', 'and', 'green'], STOP_WORDS)    ['grapes', 'come']    >>> remove_words_list(['spammy', 'title', 'intl', 'buyincoins', 'export'], STOP_WORDS)    ['spammy', 'title']    """    return [token for token in title if token not in words_to_remove]

Removing words that are solely numeric

We’ll also remove words that are solely numeric. Intuitively, an iPhone 7, iPhone 8, or iPhone 21 should all be categorized as a mobile phone, and having the numeric suffix does not add any additional useful information to categorize it better. Can you think of a product where removing the numerics would put it in different category?

Similar to above, removing numerics can be accomplished easily via list comprehension:

# Remove words that are fully numericdef remove_numeric_list(title):    """ (list(str)) -> list(str)    Remove words which are fully numeric    :param title:    :return:    >>> remove_numeric_list(['A', 'B2', '1', '123', 'C'])    ['A', 'B2', 'C']    >>> remove_numeric_list(['1', '2', '3', '123'])    []    """    return [token for token in title if not token.isdigit()]

Removing words with too few characters

We also remove words that have character length below a certain threshold. E.g., if the threshold is two, then single character words are removed; if the threshold is three, then words with two characters are removed.

To an untrained eye (like mine), double character words like “TX”, “AB”, “GT” doesn’t add much informational value to the title—though there are exceptions like “3M”. Via cross-validation, I found that removing these words led to increased accuracy.

Here’s how we remove these double character words—you can change the word length threshold to suit your needs:

# Remove words with character count below threshold from stringdef remove_chars(title, word_len=2):    """ (list(str), int) -> list(str)    Returns a list of str (tokenized titles) where tokens of character length =< word_len is removed.     :param title:     :param word_len:     :return:     >>> remove_chars(['what', 'remains', 'of', 'a', 'word', '!', ''], 1)    ['what', 'remains', 'of', 'word']    >>> remove_chars(['what', 'remains', 'of', 'a', 'word', '!', '', 'if', 'word_len', 'is', '2'], 2)    ['what', 'remains', 'word', 'word_len']    """    return [token for token in title if len(token) > word_len]

Removing duplicated words

Next, we exclude duplicated words in titles. Sometimes, titles have duplicate words due to sellers attempting to apply search engine optimisation (SEO) on their products to make them more findable. However, these duplicate words do not provide any additional information to categorizing products.

We can remove duplicate tokens by converting the token list to a token set—yes, this removes any sequential information in the title. However, we’re only doing this step to identify impure products that should not be used in training our model. During the actual data preparation, we will exclude this step.

Converting a list to a set shouldn’t be too difficult right? I’ve leave that for the reader.

Removing empty titles

Lastly, after performing all the cleaning and preparation above, there may be some titles that have no text left. (This means that those titles only contained stop words, numerics, or words with < 3 character length.) We’ll exclude these products as well.

Excluding titles that are impure

After doing the above, we’re left with titles in their most informational rich and dense form. In this case, we’re confident that products with identical titles and categories are correctly categorized, while products with identical titles but different categories have at least one error in them (i.e., impure)

Among the impure products, without having ground truth about which are correctly or incorrectly categorized, we’ll discard them and not use them to train our model.

Conclusion

Whew! That’s a lot of work just to clean titles! Nonetheless, We’re largely done with the data preparation steps.

Next, we’re going to share about the framework to making this product classifier available online, via a simple web UI. This will involve the following:

    Writing a class to take in titles, prepare them, and categorize them. Writing a simple flask app

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Dec 2016). Product Classification API Part 2: Data Preparation. eugeneyan.com. https://eugeneyan.com/writing/product-categorization-api-part-2-data-preparation/.

or

@article{yan2016preparation,  title   = {Product Classification API Part 2: Data Preparation},  author  = {Yan, Ziyou},  journal = {eugeneyan.com},  year    = {2016},  month   = {Dec},  url     = {https://eugeneyan.com/writing/product-categorization-api-part-2-data-preparation/}}
Share on:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

产品分类API 数据准备 数据纯度 标题预处理 机器学习
相关文章