MapReduce with MongoDB and Python[ZT]
2023-09-14 08:58:22 时间
MapReduce with MongoDB and Python
从 Artificial Intelligence in Motion 作者:Marcel Pinheiro Caraciolo (由于Artificial Intelligence in Motion发布的图在墙外,所以将图换到cnblogs)
Hi all,
In this post, Ill present a demonstration of a map-reduce example with MongoDB and server side JavaScript. Based on the fact that Ive been working with this technology recently, I thought it would be useful to present here a simple example of how it works and how to integrate with Python. But What is MongoDb ? For you, who doesnt know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases: SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.) Doesnt guarantee atomic operations. Distributed and horizontally scalable. It doesnt have to predefine schemas. (Non-Schema) Non-tabular data storing (eg; key-value, object, graphs, etc). Although it is not so obvious, No-SQL is an abbreviation to Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it here and here. It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that youre facing. MongoDB is one of the most popular No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as SourceForge, GitHub, Eletronic Arts and The New York Times..
One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explain how it works illustrated with a simple example using MongoDb and Python. If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.
Map- Reduce
MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers. Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming. It can be considered Data-Oriented which process data in two primary steps: Map and Reduce. On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called Map, and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called Reduce. The paper about the MapReduce with more details it can be read here. Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous isHadoop and is implemented in Java as an open-source project. In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations. In real case scenarios, in a situation where you would have used GROUP BY in SQL, map/reduce is the equivalent tool in MongoDB. Now thtat we have introduced Map-Reduce, lets see how access the MongoDB by Python.
PyMongo
PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. Its easy to install and to use. See here how to install and use it.
Map-Reduce in Action
Now lets see Map-Reduce in action. For demonstrate the map-reduce Ive decided to used of the classical problems solved using it: Word Frequency count across a series of documents. Its a simple problem and is suited to being solved by a map-reduce query. Ive decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works. The second one is the 2009 Obamas Speech at his election for president. It will be used to show a real example illustrated by the code. Lets consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split in words and grouped by the function map and after reduced independently (aggregation) by the function reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. Its also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy. Map-Reduce Distribution
Currently, the return value from a reduce function cannot be an array (its typically an object or a number) If you need to perform an operation only once, use a finalize function. Lets go now to the code. For this task, Ill use the Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obamas speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obamas Speech.
All code used in this article can be download here.
My next posts will be about performance evaluation on machine learning techniques. Wait for news!
Marcel Caraciolo
References
http://nosql.mypopescu.com/post/394779847/mongodb-tutorial-mapreduce http://fredzvt.wordpress.com/2010/04/24/no-sql-mongodb-from-introduction-to-high-level-usage-in-csharp-with-norm/
郑昀 ☑移动数据业务 times;6年 ☑语义聚合 times;4年 ☑O2O times;5年的一个老兵。
Hi all,
In this post, Ill present a demonstration of a map-reduce example with MongoDB and server side JavaScript. Based on the fact that Ive been working with this technology recently, I thought it would be useful to present here a simple example of how it works and how to integrate with Python. But What is MongoDb ? For you, who doesnt know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases: SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.) Doesnt guarantee atomic operations. Distributed and horizontally scalable. It doesnt have to predefine schemas. (Non-Schema) Non-tabular data storing (eg; key-value, object, graphs, etc). Although it is not so obvious, No-SQL is an abbreviation to Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it here and here. It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that youre facing. MongoDB is one of the most popular No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as SourceForge, GitHub, Eletronic Arts and The New York Times..
One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explain how it works illustrated with a simple example using MongoDb and Python. If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.
Map- Reduce
MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers. Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming. It can be considered Data-Oriented which process data in two primary steps: Map and Reduce. On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called Map, and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called Reduce. The paper about the MapReduce with more details it can be read here. Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous isHadoop and is implemented in Java as an open-source project. In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations. In real case scenarios, in a situation where you would have used GROUP BY in SQL, map/reduce is the equivalent tool in MongoDB. Now thtat we have introduced Map-Reduce, lets see how access the MongoDB by Python.
PyMongo
PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. Its easy to install and to use. See here how to install and use it.
Map-Reduce in Action
Now lets see Map-Reduce in action. For demonstrate the map-reduce Ive decided to used of the classical problems solved using it: Word Frequency count across a series of documents. Its a simple problem and is suited to being solved by a map-reduce query. Ive decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works. The second one is the 2009 Obamas Speech at his election for president. It will be used to show a real example illustrated by the code. Lets consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split in words and grouped by the function map and after reduced independently (aggregation) by the function reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. Its also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy. Map-Reduce Distribution
Some notes you need to know before developing your map and reduce functions: The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function: for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
Currently, the return value from a reduce function cannot be an array (its typically an object or a number) If you need to perform an operation only once, use a finalize function. Lets go now to the code. For this task, Ill use the Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obamas speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obamas Speech.
Obamas Speech in 2009
For writing our map and reduce functions, MongoDB allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.
wordMap.js
As you can see the this variable refers to the context from which the function is called. That is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as text, by callingthis.text. The map function doesnt return a list, instead it calls an emit function which it expects to be defined. This parameters of this function (key, value) will be grouped with others intermediate results from another map evaluations that have the same key (key, [value1, value2]) and passed to the function reduce that we will define now.
wordReduce.js
The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesnt matter how the mapped items are grouped. Now lets code our word count example using the Pymongo client and passing the map/reduce functions to the server.
mapReduce.py
Lets see the result now:
And it works! :D With Map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. With this brief experiment we can see the potential of map-reduce model for distributed computing, specially on large data sets.
All code used in this article can be download here.
My next posts will be about performance evaluation on machine learning techniques. Wait for news!
Marcel Caraciolo
References
http://nosql.mypopescu.com/post/394779847/mongodb-tutorial-mapreduce http://fredzvt.wordpress.com/2010/04/24/no-sql-mongodb-from-introduction-to-high-level-usage-in-csharp-with-norm/
郑昀 ☑移动数据业务 times;6年 ☑语义聚合 times;4年 ☑O2O times;5年的一个老兵。
相关文章
- Python断言及常用断言函数总结
- Python学习--21 电子邮件
- 【Python】实现从AWR 报表上抓取指定数据改进版
- python开发应用笔记-SciPy扩展库使用
- Python基于正则表达式实现文件内容替换的方法
- Python 爬虫的工具列表
- python: 安装DeOldify库:黑白图片上色(Python 3.7.15)
- python:pip升级pip本身和setuptools(Python 3.7.15)
- 【Python五篇慢慢弹(4)】模块异常谈python
- Python 字符串_python 字符串截取_python 字符串替换_python 字符串连接
- [Python] Pandas load DataFrames
- 【python cookbook】【数据结构与算法】15.根据字段将记录分组
- Atitit mongodb 使用总结 1.1. 下载有点不太好下载不像mysql导出都是。。70M1 1.2. gui工具Robomongo(MongoDB/GUI管理工具) v1.0.3 官方
- Python语言学习:Python语言学习之python包/库package的简介(模块的封装/模块路径搜索/模块导入方法/自定义导入模块实现华氏-摄氏温度转换案例应用)、使用方法、管理工具之详细攻略
- Python之tkinter:动态演示调用python库的tkinter带你进入GUI世界(Entry/Entry的Command)
- Python之ffmpeg-python:ffmpeg-python库的简介、安装、使用方法之详细攻略
- 蓝桥杯官网 试题 PREV-227 历届真题 回文日期【第十一届】【决赛】【研究生组】【C++】【C】【Java】【Python】四种解法
- 100天精通Python(进阶篇)——第40天:pymongo操作MongoDB数据库基础+代码实战
- 【python代码】:能在手机上敲 Python 代码几款App
- 这次不迷路了!最全 Python 学习路线图+14张思维导图真香啊!
- python 代码小技巧之一行代码转换列表中的数据并实现运算
- python 将一个JSON 字典转换为一个Python 对象
- Python 操作 mongodb 亿级数据量使用 Bloomfilter 高效率判断唯一性 例子
- Python技能树——进阶语法讲解(1)
- 【Python基础】python爬虫之异步网络爬虫ǃ
- 整数转罗马数字 python