https://eugeneyan.com/rss 09月30日 19:15
构建产品分类API:实现与部署
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文是关于构建产品分类API的系列文章的第三部分,重点介绍了如何创建一个自定义的Python类来处理产品标题的分类。文章详细阐述了TitleCategorize类的设计,包括初始化、标题预处理和分类方法。此外,还演示了如何使用Python装饰器来计时API函数的执行时间,并展示了如何使用Flask框架构建一个简单的Web应用程序来暴露此API。文章最后提及了机器学习模型构建和API部署到Web服务器的后续步骤,并提供了代码示例和部署建议,旨在帮助读者全面理解API的实现和部署流程。

🔹 **自定义分类类设计:** 文章介绍了如何创建一个名为`TitleCategorize`的Python类,该类负责接收产品标题,进行预处理(如编码、转小写、分词、去除停用词和数字等),然后将其输入到预先训练好的分类模型中,最终返回最相关的三个产品类别及其置信度。

🔹 **标题预处理流程:** `TitleCategorize`类中的`prepare`方法详细展示了标题预处理的多个步骤,包括使用`encode_string`进行HTML实体编码,`lower()`转为小写,`tokenize_title_string`进行分词,`remove_words_list`去除停用词,`remove_numeric_list`移除数字,`remove_chars`移除特定字符,以及`singularize_list`进行词形还原,确保输入标题的处理方式与模型训练时一致。

🔹 **API函数与计时装饰器:** 文章通过一个名为`title_categorize`的包装函数简化了`TitleCategorize`类的使用,并引入了一个`@timer`装饰器。该装饰器能够自动记录函数执行的开始和结束时间,计算并输出总耗时(毫秒),有效解决了重复代码问题,提高了代码的可维护性。

🔹 **Flask Web应用构建:** 为了将产品分类API用户化,文章展示了如何使用Flask框架构建一个简单的Web应用。通过定义`/`(主页)和`/categorize_web`(分类接口)两个路由,用户可以通过Web界面输入产品标题,并获取分类结果,同时显示处理耗时,提升了用户体验。

This post is part 3—and the last—of the series on building a product classification API. The API is available for demo here. Part 1 and 2 are available here and here. (Github repositiory)

Update: API discontinued to save on cloud cost.

In part 1, we focused on acquiring the data, and cleaning and formatting the categories. Then in part 2, we cleaned and prepared the product titles (and short description) before training our model on the data. In this post, we’ll focus on writing a custom class for the API and building an app around it.

This is part of a series of posts on building a product classification API:

The desired end result is a webpage where users can enter a product title and get the top three most appropriate categories for it, like so.

Input: Title. Output: Suggested categories.

Creating a TitleCategorize Class

In most data science work using Python, we seldom have to write our own data structures or classes. Python is rich in useful data structures like dicts, sets, lists, etc. Also, thanks to Wes McKinney, most data wrangling can be done with one main data structure/class, the pandas dataframe.

For the API, what data structure should we use?

We can continue to use the pandas dataframe and perform all our operations on it. However, we don’t need something so heavy duty (with fast indexing, joins, etc). Perhaps we should write our own class instead.

Before writing any code, lets think about how we expect the API to work:

    User provides a title as input Title is cleaned and prepared via the approach described in post 2 (title preparation for new input titles should be the same as in model training process) Prepared title is provided as input to classification model Classification model returns top x categories and associated probabilities

Based on the above, this is what our CategorizeTitle class should do:

    Take a title string as input Clean and prepare title string Input prepared title string to classification model Return results from classification model Looks simple enough. Here’s how our class looks like:
class TitleCategorize:    """    Class to predict product category given a product title.    """    def __init__(self, title):        self.title = title    def prepare(self, excluded='-.'):        """ (str) -> list(str)        Returns the title after it has been prepared by the process from clean titles        :return:        >>> TitleCategorize('Crème brûlée " &  ').prepare()        ['creme', 'brulee']        >>> TitleCategorize('test hyphen-word 0.9 20% green/blue').prepare()        ['test', 'hyphen-word', '0.9']        >>> TitleCategorize('grapes come in purple and green').prepare()        ['grapes', 'come']        >>> TitleCategorize('what remains of a word ! if wordlen is 2').prepare()        ['remains', 'word', 'wordlen']        """        self.title = encode_string(self.title, HTML_PARSER)        self.title = self.title.lower()        self.title = tokenize_title_string(self.title, excluded)        self.title = remove_words_list(self.title, STOP_WORDS)        self.title = remove_numeric_list(self.title)        self.title = remove_chars(self.title, 1)        self.title = singularize_list(self.title)        logger.info('Title after preparation: {}'.format(self.title))        return self    def categorize(self):        """ (CategorizeSingle(str)) -> dict        Categorizes prepared title and returns a dictionary of form {1: 'Cat1', 2: 'Cat2', 3: 'Cat3}        :return:        >>> TitleCategorize('This is a bookshelf with wood and a clock').prepare().categorize()        {1: 'Electronics -> Home Audio -> Stereo Components -> Speakers -> Bookshelf Speakers',        2: 'Electronics -> Computers & Accessories -> Data Storage -> USB Flash Drives',        3: 'Home & Kitchen -> Furniture -> Home Office Furniture -> Bookcases'}        """        result_list = get_score(self.title, model, 3)        result_dict = dict()        for i, category in enumerate(result_list):            result_dict[i + 1] = category        return result_dict

Here’s a breakdown of the class methods:

    Init method initialises the class with the title string provided Prepare method… well, prepares title string via encoding, lowercasing, tokenizing, etc. Categorize method then inputs prepared title to the classification model and returns results in a dictionary

Wrapping it in a class

We can further simplify the use of the TitleCategorize class by wrapping it in a function. This allows usage of the class via a simple function call, as well as wrap the class with other utility functions (such as a time logger).

@timerdef title_categorize(title):    """ (str) -> dict    Initializes given title as Title class and returns a dictionary of top 3 options.    :param title:    :return:    """    result = TitleCategorize(title).prepare().categorize()    return result

Timing how long the API takes

If you’ve used the product classification API (here), you’ll notice it displays the time taken to return a result. Code profiling and logging can be useful in improving and monitoring the performance of an API.

One way to log the time is by adding code to track the start time and end time of the function, and the getting the difference. Something like this:

def title_categorize(title):    """ (str) -> dict    Initializes given title as Title class and returns a dictionary of top 3 options.    :param title:    :return:    """    start_time = datetime.datetime.now()    result = TitleCategorize(title).prepare().categorize()    end_time = datetime.datetime.now()    elapsed_time = end_time - start_time    elapsed_time = elapsed_time.total_seconds() * 1000    logger.debug('Time taken: {} ms'.format(elapsed_time))    return result

However, if you have multiple APIs, this mean duplicating this “timer” code for each API, violating the DRY (Don’t repeat yourself) principle. It also adds a lot of code to your wrapper functions. And what if you decide to change the time format? You’ll have to edit as much “timer” code as you have wrapper functions.

Fortunately, Python’s decorators allow us to write a utility timer once, and decorate our functions with it. This explains the @timer in the title_categorize() function above. Here’s how the timer decorator looks like:

def timer(function_to_time):    """    Decorator that times the duration to get result from function    :param function_to_time:    :return:    """    def wrapper(*args, **kwargs):        start_time = datetime.datetime.now()        result = function_to_time(*args)        end_time = datetime.datetime.now()        elapsed_time = end_time - start_time        elapsed_time = elapsed_time.total_seconds() * 1000        logger.debug('Time taken: {} ms'.format(elapsed_time))        return result, elapsed_time    return wrapper

Building the (Flask) app

Okay, the class and wrapper function required for the product classification API are created. Next, how can we expose it in a user friendly manner?

One way is to build a simple Flask app. Flask makes it easy to quickly create web applications. I won’t go into the details of Flask in this post—you can find out more here.

Writing the routes (i.e., URLs)

First, we’ll need to create the routes.py. This is where you list URLs for your web app. For now, we’ll just have the home page (/) and product classification page (/categorize_web).

@app.route('/')def index():    return render_template('index.html')@app.route('/categorize_web', methods=['GET', 'POST'])def categorize_web():    """    Returns top three category options for the title in web.     If input form is empty, returns result suggesting user to type something in input form.    :return:    """    if request.method == 'POST':        # Read the posted values        _title = request.form['title'].encode('utf-8')  # encode to utf 8        logger.info('Title form input: {}'.format(_title))    else:        result, elapsed_time = {0: 'Type something in the product title field.'}, 0    return render_template('categorize_web.html', result=result, elapsed_time=elapsed_time)

How the categorize web route works is simple. If a user has entered and submitted a title, the title_categorize function is called with the title as input, and the result returned in categorized_web.html. If it is the user’s first landing on the page (and no title is submitted), a GET request is triggered and a placeholder result is returned.

Many scenarios can occur on this page. What if the user presses submit without entering a title? What if there’s no result for the title provided? With some simple logic, you can handle these cases—I’ve not included them here to keep things simple.

Trivia: Why is the url categorize_web instead of simply categorize? I had initially built the API as a HTTP POST only API to be accessed vial curl—this original API has the url categorize.

Creating a shiny front-end

After setting up the routes, we’ll also need to set up the HTML for each of the urls. Writing about HTML could make up an entire piece on its own, and there are many good blogs out there. This post will not cover the HTML aspects of datagene.io (and I’ll probably not write about HTML ever).

The HTML for datagene.io was not too difficult to set up, and is mainly based on bootstrap.

3, 2, 1, Blast off!

TitleCategorize class? Check.

Flask app? Check.

HTML? Check.

Now we’re ready to start our API. Flask makes starting the API simple. All you have to do is import the app, and start it like so:

from app.routes import appif __name__ == '__main__':    app.run()

Your product classification API will then be running on localhost:5000. Here’s how it might look likhttps://eugeneyan.com/assets/product-classification-empty-input.webption-empty-input.webp" title="Product classification input empty" loading="lazy" alt="Product classification input empty">

SWEET!

Conclusion

And there you have it—how to create your own product classification API and expose it.

For the sake of simplicity, we did not cover the machine learning aspects of building a product classifier. There are many good articles on machine learning available and there was no need to duplicate content.

In addition, we didn’t cover how to expose the API on the web. To do so, you’ll need to set it up on a web server (I use AWS) and expose the port. Sounds simple, but I found it trickier than initially thought.

I hope you enjoyed and learned from this three-part series. Any feedback is welcome!

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Feb 2017). Product Categorization API Part 3: Creating an API. eugeneyan.com. https://eugeneyan.com/writing/product-categorization-api-part-3-creating-an-api/.

or

@article{yan2017api,  title   = {Product Categorization API Part 3: Creating an API},  author  = {Yan, Ziyou},  journal = {eugeneyan.com},  year    = {2017},  month   = {Feb},  url     = {https://eugeneyan.com/writing/product-categorization-api-part-3-creating-an-api/}}
Share on:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

产品分类API Python Flask API开发 机器学习 数据科学 Product Classification API Python Flask API Development Machine Learning Data Science
相关文章