Monday, December 30, 2013

[repost ]Understanding the Internal Message Buffers of Storm

original:http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/ When you are optimizing the performance of your Storm topologies it helps to understand how Storm’s internal message queues are configured and put to use. In this short article I will explain and illustrate how Storm version 0.8/0.9 implements the intra-worker communication that happens within a worker process and its associated executor threads. Internal [...]



via WordPress http://blog.newitfarmer.com/big_data/streams/storm/13838/repost-understanding-the-internal-message-buffers-of-storm#utm_source=rss&utm_medium=rss&utm_campaign=repost-understanding-the-internal-message-buffers-of-storm

Labels:

[repost ]Bootstrapping a Java Project With Gradle, TestNG, Mockito and Cobertura for Eclipse and Jenkins

original:http://www.michael-noll.com/blog/2013/01/25/bootstrapping-a-java-project-with-gradle/ When starting out with a fresh Java project one of the nuisances you have to deal with is setting up your build and test environment. It’s even more troublesome if you are trying to switch from Maven to Gradle for your builds. In this article I will provide you with a bootstrap Java project [...]



via WordPress http://blog.newitfarmer.com/java/others-java/13835/repost-bootstrapping-a-java-project-with-gradle-testng-mockito-and-cobertura-for-eclipse-and-jenkins#utm_source=rss&utm_medium=rss&utm_campaign=repost-bootstrapping-a-java-project-with-gradle-testng-mockito-and-cobertura-for-eclipse-and-jenkins

Labels:

[rpost ]Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node

original:http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/ In this article I describe how to install, configure and run a multi-broker Apache Kafka 0.8 (trunk) cluster on a single machine. The final setup consists of one local ZooKeeper instance and three local Kafka brokers. We will test-drive the setup by sending messages to the cluster via a console producer and receive those [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13831/rpost-running-a-multi-broker-apache-kafka-0-8-cluster-on-a-single-node#utm_source=rss&utm_medium=rss&utm_campaign=rpost-running-a-multi-broker-apache-kafka-0-8-cluster-on-a-single-node

Labels:

[repost ]Finding Bugs at High Speed: Kafka Thread Safety

original:http://engineering.gnip.com/kafka-thread-safety/ Whenever you are adding a new component to a system there are going to be integration pains. What do all these log lines mean? Is this an error I have to worry about, or is it part of normal operation? Sometimes these questions take a while to answer, particularly if the tools you are [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13829/repost-finding-bugs-at-high-speed-kafka-thread-safety#utm_source=rss&utm_medium=rss&utm_campaign=repost-finding-bugs-at-high-speed-kafka-thread-safety

Labels:

[repost ]Custom Event Batching with Apache Kafka

original:http://engineering.gnip.com/custom-event-batching-with-apache-kafka/ Kafka provides a fair amount of configuration and pluggability out of the box. With configuration settings, you can control many features on the producer, broker, and consumer. Beyond simple settings, further customization can be found in several pluggable interfaces. Looking at the producer side in particular, control over things like synchronous vs. asynchronous behavior, [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13825/repost-custom-event-batching-with-apache-kafka#utm_source=rss&utm_medium=rss&utm_campaign=repost-custom-event-batching-with-apache-kafka

Labels:

[repost ]分布式集群内存数据技术引领12306技术革命

original:http://server.chinabyte.com/151/12820151.shtml 中国铁路客户服务中心网站(www.12306.cn)是世界规模最大的实时交易系统之一,媲美Amazon.com,节假日尤其是春节的访问高峰,网站压力巨大。据统计, 在2012年初的春运高峰期间,每天有2000万人访问该网站,日点击量最高达到14亿。大量同时涌入的网络访问造成12306几近瘫痪。 中国铁道科学院电子计算技术研究所作为12306互联网购票系统的承建单位,急需寻求办法解决问题。 成功解决:速度提高75倍以上 2012年3月开始,铁路总公司(原铁道部)开始调研、改造12306。2012年6月选择了Pivotal GemFire分布式内存计算平台(Distributed In-memory computing)改造12306,由铁科院项目小组负责人王明哲主任和资拓宏宇(IISI)信息科技有限公司在铁科院主管朱建生所长领导下提供技术实施。一期先改造12306的主要瓶颈——余票查询系统。9月份完成代码改造,系统上线。2012年国庆,又是网上订票高峰期间,大家可以显著发现,可以登录12306,虽然还是很难订票,但是查询余票很快。2012年10月份,二期用GemFire改造订单查询系统(客户查询自己的订单记录)。2013年春节,又是网上订票高峰期间,大家可以显著发现,可以登录12306,虽然还是很难订票,但是查询余票很快,而且查询自己的订票和下订单也很快。   根据系统运行数据记录,技术改造之后,在只采用10几台X86服务器实现了以前数十台小型机的余票计算和查询能力,单次查询的最长时间从之前的15秒左右下降到0.2秒以下,缩短了75倍以上。2012年春运的极端高流量并发情况下,系统几近瘫痪。而在改造之后,支持每秒上万次的并发查询,高峰期间达到2.6万个查询/秒吞吐量,整个系统效率显著提高。如上图所示。 订单查询系统改造,在改造之前的系统运行模式下,每秒只能支持300-400个查询/秒的吞吐量,高流量的并发查询只能通过分库来实现。改造之后,可以实现高达上万个查询/秒的吞吐量,而且查询速度可以保障在20毫秒左右。 新的技术架构可以按需弹性动态扩展,并发量增加时,还可以通过动态增加X86服务器来应对,保持毫秒级的响应时间。 技术革命一步跨越三代 12306能够取得这样翻天覆地的效果,靠技术上的小修小补是不可能的,必须有全新的思路,能够给性能提升带来杠杆式的作用。12306发现GemFire分布式内存数据平台就是这样一种技术。   GemFire分布式内存数据平台的技术原理如上图所示:通过云计算平台虚拟化技术,将若干X86服务器的内存集中起来,组成最高可达数十TB的内存资源池,将全部数据加载到内存中,进行内存计算。计算过程本身不需要读写磁盘,只是定期将数据同步或异步方式写到磁盘。GemFire在分布式集群中保存了多份数据,任何一台机器故障,其它机器上还有备份数据,因此通常不用担心数据丢失,而且有磁盘数据作为备份。GemFire支持把内存数据持久化到各种传统的关系数据库、Hadoop库和其它文件系统中。   大家知道,当前计算架构的瓶颈在存储,处理器的速度按照摩尔定律翻番增长,而磁盘存储的速度增长很缓慢,由此造成巨大高达10万倍的差距(如上图)。这样就很好理解GemFire为什么能够大幅提高系统性能了。 按照计算与存储的关系,我们可以将计算架构分为四代: 第一代,基于磁盘的单一系统:计算过程中需要从磁盘读取数据。小型机、大型机是其中的佼佼者,将单一系统的性能做到极致。 第二代,基于磁盘的分布式集群系统:计算过程中需要从磁盘读取数据,但通过分布系统将数据分散到不同的服务器磁盘上,提高整个系统的处理能力。目前很多大型互联网和电子商务公司采用基于X86服务器的分布式集群系统,依靠海量的X86服务器部署解决高流量并发的问题。 第三代,基于内存的单一系统:将整个数据库放在内存中,计算过程不需要从磁盘读取数据。整个系统的性能取决于单一系统的性能。传统的内存数据库就是这样的系统,对于企业级的应用可以很好地解决访问速度的问题,但面对海量数据或是海量并发访问的扩展性问题就无能为力。 第四代,基于内存的分布式集群系统:GemFire就是这样的系统,并行计算是其关键技术之一,因而可以通过增加服务器部署规模,在内存计算的基础上,线性扩展性能。   12306之前采用Unix小型机架构,采用GemFire技术改造成Linux/X86服务器集群架构,就意味着一下跨越三代。从小型机到大内存X86服务器集群,不仅让性能提升了一个数量级,而且成本也要低得多。 GemFire是Pivotal企业级大数据PaaS平台的一部分。Pivotal公司的企业级大数据PaaS平台主要有三个层次:云基础架构层Cloud Fabric、大数据基础架构层Data Fabric、应用开发基础架构层Application Fabric。GemFire属于大数据基础架构层,此外,Greenplum数据库也属于这一层;云基础架构层的技术是Cloud Foundry;应用开发基础架构层的技术是Spring Framework和RabbitMQ等



via WordPress http://blog.newitfarmer.com/big_data/big-data_store/gemfire/13824/repost-%e5%88%86%e5%b8%83%e5%bc%8f%e9%9b%86%e7%be%a4%e5%86%85%e5%ad%98%e6%95%b0%e6%8d%ae%e6%8a%80%e6%9c%af%e5%bc%95%e9%a2%8612306%e6%8a%80%e6%9c%af%e9%9d%a9%e5%91%bd#utm_source=rss&utm_medium=rss&utm_campaign=repost-%25e5%2588%2586%25e5%25b8%2583%25e5%25bc%258f%25e9%259b%2586%25e7%25be%25a4%25e5%2586%2585%25e5%25ad%2598%25e6%2595%25b0%25e6%258d%25ae%25e6%258a%2580%25e6%259c%25af%25e5%25bc%2595%25e9%25a2%258612306%25e6%258a%2580%25e6%259c%25af%25e9%259d%25a9%25e5%2591%25bd

Labels:

[rpost ]12306:分布式内存数据技术为查询提速75倍

original:http://www.ctocio.com.cn/cloud/120/12820120.shtml 背景和需求 中国铁路客户服务中心网站(www.12306.cn)是世界规模最大的实时交易系统之一,媲美Amazon.com,节假日尤其是春节的访问高峰,网站压力巨大。据统计, 在2012年初的春运高峰期间,每天有2000万人访问该网站,日点击量最高达到14亿。大量同时涌入的网络访问造成12306几近瘫痪。 中国铁道科学院电子计算技术研究所作为12306互联网购票系统的承建单位,急需寻求办法解决问题。   成功解决:速度提高75倍以上   2012年3月开始,铁路总公司(原铁道部)开始调研、改造12306。2012年6月选择了Pivotal GemFire分布式内存计算平台(Distributed In-memory computing)改造12306,由铁科院项目小组负责人王明哲主任和资拓宏宇(IISI)信息科技有限公司在铁科院主管朱建生所长领导下提供技术实施。一期先改造12306的主要瓶颈——余票查询系统。9月份完成代码改造,系统上线。2012年国庆,又是网上订票高峰期间,大家可以显著发现,可以登录12306,虽然还是很难订票,但是查询余票很快。2012年10月份,二期用GemFire改造订单查询系统(客户查询自己的订单记录)。2013年春节,又是网上订票高峰期间,大家可以显著发现,可以登录12306,虽然还是很难订票,但是查询余票很快,而且查询自己的订票和下订单也很快。 根据系统运行数据记录,技术改造之后,在只采用10几台X86服务器实现了以前数十台小型机的余票计算和查询能力,单次查询的最长时间从之前的15秒左右下降到0.2秒以下,缩短了75倍以上。2012年春运的极端高流量并发情况下,系统几近瘫痪。而在改造之后,支持每秒上万次的并发查询,高峰期间达到2.6万个查询/秒吞吐量,整个系统效率显著提高。如上图所示。 订单查询系统改造,在改造之前的系统运行模式下,每秒只能支持300-400个查询/秒的吞吐量,高流量的并发查询只能通过分库来实现。改造之后,可以实现高达上万个查询/秒的吞吐量,而且查询速度可以保障在20毫秒左右。 新的技术架构可以按需弹性动态扩展,并发量增加时,还可以通过动态增加X86服务器来应对,保持毫秒级的响应时间。   梦里寻它:技术革命一步跨越三代 12306能够取得这样翻天覆地的效果,靠技术上的小修小补是不可能的,必须有全新的思路,能够给性能提升带来杠杆式的作用。12306发现GemFire分布式内存数据平台就是这样一种技术。   GemFire分布式内存数据平台的技术原理如上图所示:通过云计算平台虚拟化技术,将若干X86服务器的内存集中起来,组成最高可达数十TB的内存资源池,将全部数据加载到内存中,进行内存计算。计算过程本身不需要读写磁盘,只是定期将数据同步或异步方式写到磁盘。GemFire在分布式集群中保存了多份数据,任何一台机器故障,其它机器上还有备份数据,因此通常不用担心数据丢失,而且有磁盘数据作为备份。GemFire支持把内存数据持久化到各种传统的关系数据库、Hadoop库和其它文件系统中。   大家知道,当前计算架构的瓶颈在存储,处理器的速度按照摩尔定律翻番增长,而磁盘存储的速度增长很缓慢,由此造成巨大高达10万倍的差距(如上图)。这样就很好理解GemFire为什么能够大幅提高系统性能了。 按照计算与存储的关系,我们可以将计算架构分为四代: 第一代,基于磁盘的单一系统:计算过程中需要从磁盘读取数据。小型机、大型机是其中的佼佼者,将单一系统的性能做到极致。 第二代,基于磁盘的分布式集群系统:计算过程中需要从磁盘读取数据,但通过分布系统将数据分散到不同的服务器磁盘上,提高整个系统的处理能力。目前很多大型互联网和电子商务公司采用基于X86服务器的分布式集群系统,依靠海量的X86服务器部署解决高流量并发的问题。 第三代,基于内存的单一系统:将整个数据库放在内存中,计算过程不需要从磁盘读取数据。整个系统的性能取决于单一系统的性能。传统的内存数据库就是这样的系统,对于企业级的应用可以很好地解决访问速度的问题,但面对海量数据或是海量并发访问的扩展性问题就无能为力。 第四代,基于内存的分布式集群系统:GemFire就是这样的系统,并行计算是其关键技术之一,因而可以通过增加服务器部署规模,在内存计算的基础上,线性扩展性能。   12306之前采用Unix小型机架构,采用GemFire技术改造成Linux/X86服务器集群架构,就意味着一下跨越三代。从小型机到大内存X86服务器集群,不仅让性能提升了一个数量级,而且成本也要低得多。 GemFire是Pivotal企业级大数据PaaS平台的一部分。Pivotal公司的企业级大数据PaaS平台主要有三个层次:云基础架构层Cloud Fabric、大数据基础架构层Data Fabric、应用开发基础架构层Application Fabric。GemFire属于大数据基础架构层,此外,Greenplum数据库也属于这一层;云基础架构层的技术是Cloud Foundry;应用开发基础架构层的技术是Spring Framework和RabbitMQ等。 对于此次引入GemFire技术的改造,中国铁道科学研究院电子计算技术研究所副所长朱建生表示:“通过技术改造解决了困扰我们多时的尖峰高流量并发问题,让全国人民不再因为技术原因而抱怨,我们终于舒了一口气。Pivotal GemFire分布式集群内存数据技术对整个技术改造发挥了关键的作用。同时,感谢Pivotal公司及其实施方项目团队的努力,在技术开改造过程中确保旧系统顺畅运行、旧系统到新系统平滑迁移,快速实现新系统的上线。”



via WordPress http://blog.newitfarmer.com/big_data/big-data_store/gemfire/13823/rpost-12306%ef%bc%9a%e5%88%86%e5%b8%83%e5%bc%8f%e5%86%85%e5%ad%98%e6%95%b0%e6%8d%ae%e6%8a%80%e6%9c%af%e4%b8%ba%e6%9f%a5%e8%af%a2%e6%8f%90%e9%80%9f75%e5%80%8d#utm_source=rss&utm_medium=rss&utm_campaign=rpost-12306%25ef%25bc%259a%25e5%2588%2586%25e5%25b8%2583%25e5%25bc%258f%25e5%2586%2585%25e5%25ad%2598%25e6%2595%25b0%25e6%258d%25ae%25e6%258a%2580%25e6%259c%25af%25e4%25b8%25ba%25e6%259f%25a5%25e8%25af%25a2%25e6%258f%2590%25e9%2580%259f75%25e5%2580%258d

Labels:

[repost ]A Guide To The Kafka Protocol

original:https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol Introduction Overview Preliminaries Network Partitioning and bootstrapping Partitioning Strategies Batching Versioning and Compatibility The Protocol Protocol Primitive Types Notes on reading the request format grammars Common Request and Response Structure Requests Responses Message sets Compression The APIs Metadata API Metadata Request Metadata Response Produce API Produce Request Produce Response Fetch API Fetch Request Fetch [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13820/repost-a-guide-to-the-kafka-protocol#utm_source=rss&utm_medium=rss&utm_campaign=repost-a-guide-to-the-kafka-protocol

Labels:

[repost ]kafka Performance testing

original:https://cwiki.apache.org/confluence/display/KAFKA/Performance+testing Required Metrics Client Side Measurements Common Stats GC Log Analysis Server side metrics Log analysis Miscellaneous Phase I: Perf Tools Phase II: Automation Phase III: Correctness 0.8 Performance testing Producer throughput It would be worthwhile to automate our performance testing to act as a generic integration test suite. The goal of this would be [...]



via WordPress http://blog.newitfarmer.com/architecture/performance/13817/repost-kafka-performance-testing#utm_source=rss&utm_medium=rss&utm_campaign=repost-kafka-performance-testing

Labels:

[repost ]Kafka Internals

original:https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Internals This page is meant to give a high-level introduction to the Kafka code base and the major subsystems. It is meant to help you learn the code base. Here is an overview of a few of the subsystems: Kafka API layer LogManager and Log ReplicaManager ZookeeperConsumerConnector Here is a basic diagram of how these [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13815/repost-kafka-internals#utm_source=rss&utm_medium=rss&utm_campaign=repost-kafka-internals

Labels:

[repost ]Kafka Replication

original:https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication Kafka Replication High-level Design Replica placements Initial placement Incrementally add brokers online Take brokers offline Data replication Related work Synchronous replication Writes Reads Failure scenarios Follower failure Leader failure Asynchronous replication Open Issues Kafka Replication Detailed Design Kafka Replication High-level Design The purpose of adding replication in Kafka is for stronger durability and higher [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13813/repost-kafka-replication#utm_source=rss&utm_medium=rss&utm_campaign=repost-kafka-replication

Labels:

[repost ]KafkaSpout is not receiving anything from Kafka

original:http://stackoverflow.com/questions/17807292/kafkaspout-is-not-receiving-anything-from-kafka I am trying to rig up a a Kafka-Storm “Hello World” system. I have Kafka installed and running, when I send data with the Kafka producer I can read it with the Kafka console consumer. I took the Chapter 02 example from the “Getting Started With Storm” O’Reilly book, and modified it to use [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13811/repost-kafkaspout-is-not-receiving-anything-from-kafka#utm_source=rss&utm_medium=rss&utm_campaign=repost-kafkaspout-is-not-receiving-anything-from-kafka

Labels:

Sunday, December 29, 2013

[repost ]kafka 0.7 Performance Results

original:http://kafka.apache.org/07/performance.html The following tests give some basic information on Kafka throughput as the number of topics, consumers and producers and overall data size varies. Since Kafka nodes are independent, these tests are run with a single producer, consumer, and broker machine. Results can be extrapolated for a larger cluster. We run producer and consumer tests [...]



via WordPress http://blog.newitfarmer.com/architecture/performance/13808/repost-kafka-0-7-performance-results#utm_source=rss&utm_medium=rss&utm_campaign=repost-kafka-0-7-performance-results

Labels:

Thursday, December 26, 2013

[repost ]all versions of storm/storm-kafka

original:https://clojars.org/storm/storm-kafka/versions 0.9.0-wip16a-scala292 0.9.0-wip15b-scala292 0.9.0-wip7-scala292-multischeme 0.9.0-wip6-scala292-multischeme 0.9.0-wip15-scala292 0.8.0-wip4 0.8.0-wip3 0.7.6-dynamic-json-SNAPSHOT 0.7.5-dynamic-json 0.7.4-dynamic-json-SNAPSHOT 0.7.2-snaptmp8 0.7.2-snaptmp7 0.7.2-snaptmp6 0.7.2-snaptmp5 0.7.2-snaptmp4 0.7.2-snaptmp3 0.7.2-snaptmp2 0.7.2-snaptmp 0.7.2-snap2 0.7.2-snap 0.7.2-SNAPSHOT 0.7.1-SNAPSHOT 0.7.1-kafka7-SNAPSHOT 0.7.0-kafka7-SNAPSHOT 0.7.0-SNAPSHOT



via WordPress http://blog.newitfarmer.com/message/kafka/13794/repost-all-versions-of-stormstorm-kafka#utm_source=rss&utm_medium=rss&utm_campaign=repost-all-versions-of-stormstorm-kafka

Labels:

Saturday, December 21, 2013

[repost ]The Log: What every software engineer should know about real-time data’s unifying abstraction

original:http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition to a portfolio of specialized distributed systems. This has been an interesting experience: we built, deployed, and run to this day [...]



via WordPress http://blog.newitfarmer.com/architecture/distributed/13783/repost-the-log-what-every-software-engineer-should-know-about-real-time-datas-unifying-abstraction#utm_source=rss&utm_medium=rss&utm_campaign=repost-the-log-what-every-software-engineer-should-know-about-real-time-datas-unifying-abstraction

Labels:

[repost ]How to use Kafka and Avro

original:http://stackoverflow.com/questions/8298308/how-to-use-kafka-and-avro I’m trying to use use Avro for messages being read from/written to Kafka. However, I’m not sure if I’m encoding the data correctly. Does anyone have an example of using the Avro binary encoder to encode/decode data that will not be used via RPC, such as writing it to a file or, in this [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13778/repost-how-to-use-kafka-and-avro#utm_source=rss&utm_medium=rss&utm_campaign=repost-how-to-use-kafka-and-avro

Labels:

[repost ] 启动storm时出现OSError: [Errno 2] No such file or directory 解决方案

original:http://blog.chinaunix.net/uid-1757778-id-3920971.html [root@localhost bin]# ./storm ui Traceback (most recent call last): File “./storm”, line 409, in <module> main() File “./storm”, line 406, in main (COMMANDS.get(COMMAND, unknown_command))(*ARGS) File “./storm”, line 288, in ui jvmopts = parse_args(confvalue(“ui.childopts”, cppaths)) + [ File "./storm", line 60, in confvalue p = sub.Popen(command, stdout=sub.PIPE) File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__ errread, errwrite) [...]



via WordPress http://blog.newitfarmer.com/big_data/streams/storm/13774/repost-%e5%90%af%e5%8a%a8storm%e6%97%b6%e5%87%ba%e7%8e%b0oserror-errno-2-no-such-file-or-directory-%e8%a7%a3%e5%86%b3%e6%96%b9%e6%a1%88#utm_source=rss&utm_medium=rss&utm_campaign=repost-%25e5%2590%25af%25e5%258a%25a8storm%25e6%2597%25b6%25e5%2587%25ba%25e7%258e%25b0oserror-errno-2-no-such-file-or-directory-%25e8%25a7%25a3%25e5%2586%25b3%25e6%2596%25b9%25e6%25a1%2588

Labels:

Friday, December 20, 2013

[repost ]Apache Kafka Default Encoder Not Working

original:http://stackoverflow.com/questions/19017422/apache-kafka-default-encoder-not-working I am using Kafka 0.8 beta, and I am just trying to mess around with sending different objects, serializing them using my own encoder, and sending them to an existing broker configuration. For now I am trying to get just DefaultEncoder working. I have the broker and everything setup and working for StringEncoder, but [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13770/repost-apache-kafka-default-encoder-not-working#utm_source=rss&utm_medium=rss&utm_campaign=repost-apache-kafka-default-encoder-not-working

Labels:

Monday, December 16, 2013

[repost ]oAuth with Scribe for LinkedIn

original;http://stackoverflow.com/questions/10265743/oauth-with-scribe-for-linkedin-accesstoken-issu question : I’m using scribe for logging into LinkedIn in my application. I would like to know if there is a way to automate the process of getting accessToken so that the user doesn’t have to enter the Verifier token. Possible? If yes, may i get a little help with the same? Answer: You [...]



via WordPress http://blog.newitfarmer.com/security/oauth-security/13746/repost-oauth-with-scribe-for-linkedin#utm_source=rss&utm_medium=rss&utm_campaign=repost-oauth-with-scribe-for-linkedin

Labels:

Friday, December 13, 2013

[repost ] 二项分布、指数分布与泊松分布的关系

original:http://my.oschina.net/u/347414/blog/129195 1、泊松分布 由法国数学家西莫恩·德尼·泊松(Siméon-Denis Poisson)在1838年时发表; 若X服从参数为的泊松分布,记为X~P(), 泊松分布的概率分布函数: 参数λ是单位时间(或单位面积)内随机事件的平均发生率。 统计学上,满足三个条件,即可用泊松分布 (1)小概率事件,两次以上事件发生概率趋于0;(2)事件发生的概率独立且互不影响;(3)发生概率时稳定的; Poisson分布主要用于描述在单位时间(空间)中稀有事件的发生数,例如: 1.放射性物质在单位时间内的放射次数; 2.在单位容积充分摇匀的水中的细菌数; 3.野外单位空间中的某种昆虫数等。 二、二项分布 记作ξ~B(n,p) 期望:Eξ=np 方差:Dξ=npq 三、二项分布和泊松分布的关系(泊松分布的来源(泊松小数定律) 在二项分布的n次伯努利试验中,如果试验次数n很大,二项分布的概率p很小,且乘积λ= n p比较适中,则事件出现的次数的概率可以用泊松分布来逼近。事实上,二项分布可以看作泊松分布在离散时间上的对应物。 回顾e的定义: 二项分布的定义: 如果令p=λ/n, p趋于无穷时的极限: 四、泊松分布与指数分布 泊松过程是一种重要的随机过程,适合于描述单位时间内随机事件发生的次数。泊松过程中,第k次随机事件与第k+1次随机事件出现的时间间隔服从指数分布。这是因为,第k次随机事件之后长度为t的时间段内,第k+1次随机事件出现的概率等于1减去这个时间段内没有随机事件出现的概率。而根据泊松过程的定义,长度为t的时间段内没有随机事件出现的概率等于 所以第k次随机事件之后长度为t的时间段内,第k+1次随机事件出现的概率等于,这是指数分布。这还表明了泊松过程的无记忆性。 五、最大似然估计 6、 指数分布比幂分布趋近0的速度慢很多,所以有一条很长的尾巴。指数分布很多时候被认为是长尾分布。互联网网页链接的出度入度符合指数分布。



via WordPress http://blog.newitfarmer.com/ai/math/13737/repost-%e4%ba%8c%e9%a1%b9%e5%88%86%e5%b8%83%e3%80%81%e6%8c%87%e6%95%b0%e5%88%86%e5%b8%83%e4%b8%8e%e6%b3%8a%e6%9d%be%e5%88%86%e5%b8%83%e7%9a%84%e5%85%b3%e7%b3%bb#utm_source=rss&utm_medium=rss&utm_campaign=repost-%25e4%25ba%258c%25e9%25a1%25b9%25e5%2588%2586%25e5%25b8%2583%25e3%2580%2581%25e6%258c%2587%25e6%2595%25b0%25e5%2588%2586%25e5%25b8%2583%25e4%25b8%258e%25e6%25b3%258a%25e6%259d%25be%25e5%2588%2586%25e5%25b8%2583%25e7%259a%2584%25e5%2585%25b3%25e7%25b3%25bb

Labels:

[repost ]Data Processing API in Apache Tez

original:http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/ This post is the second in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts: Apache Tez: A New Chapter in Hadoop Data Processing Data Processing API in Apache Tez Runtime API in Apache Tez Writing a Tez Input/Processor/Output Apache [...]



via WordPress http://blog.newitfarmer.com/big_data/big-data_computing/tez/13726/repost-data-processing-api-in-apache-tez#utm_source=rss&utm_medium=rss&utm_campaign=repost-data-processing-api-in-apache-tez

Labels:

[repost ]Introduction to the Hadoop Software Ecosystem

original:http://www.revelytix.com/?q=content/hadoop-ecosystem Via A. Griffins When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and MapReduce, it soon became clear that Hadoop was not simply another application or service, but a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being [...]



via WordPress http://blog.newitfarmer.com/big_data/big-data-production/hadoop/13708/repost-introduction-to-the-hadoop-software-ecosystem#utm_source=rss&utm_medium=rss&utm_campaign=repost-introduction-to-the-hadoop-software-ecosystem

Labels:

repost ]Spark Summit 2013





via WordPress http://blog.newitfarmer.com/others/13703/repost-spark-summit-2013#utm_source=rss&utm_medium=rss&utm_campaign=repost-spark-summit-2013

Labels:

Thursday, December 12, 2013

[repost ]Large scale graph processing with apache giraph

original:https://speakerdeck.com/fs111/large-scale-graph-processing-with-apache-giraph



via WordPress http://blog.newitfarmer.com/anls/graph-analytics/giraph/13694/repost-large-scale-graph-processing-with-apache-giraph-2#utm_source=rss&utm_medium=rss&utm_campaign=repost-large-scale-graph-processing-with-apache-giraph-2

Labels:

[repost ]Large Scale Graph Processing with Apache Giraph

original:http://de.slideshare.net/sscdotopen/large-scale Large Scale Graph Processing with Apache Giraph from sscdotopen



via WordPress http://blog.newitfarmer.com/anls/graph-analytics/giraph/13692/repost-large-scale-graph-processing-with-apache-giraph#utm_source=rss&utm_medium=rss&utm_campaign=repost-large-scale-graph-processing-with-apache-giraph

Labels:

[repost ]apigee.com console

original:https://apigee.com/console/linkedin



via WordPress http://blog.newitfarmer.com/security/oauth-security/13682/repost-apigee-com-console#utm_source=rss&utm_medium=rss&utm_campaign=repost-apigee-com-console

Labels:

Wednesday, December 11, 2013

[repost ]Machine Learning Library (MLlib)

original:http://spark.incubator.apache.org/docs/latest/mllib-guide.html MLlib is a Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators. MLlib currently supports four common types of machine learning problem settings, namely, binary classification, regression, clustering and collaborative filtering, as well as an underlying gradient descent optimization primitive. This guide will outline the functionality supported [...]



via WordPress http://blog.newitfarmer.com/ai/machine-learning/mllib/13672/repost-machine-learning-library-mllib#utm_source=rss&utm_medium=rss&utm_campaign=repost-machine-learning-library-mllib

Labels:

[repost ]GraphLab Topic Modeling

original:http://docs.graphlab.org/topic_modeling.html The topic modeling toolkit contains a collection of applications targeted at clustering documents and extracting topical representations. The resulting topical representation can be used as a feature space in information retrieval tasks and to group topically related words and documents. Currently the text modeling toolkit implements a fast asynchronous collapsed Gibbs sampler for the [...]



via WordPress http://blog.newitfarmer.com/ai/machine-learning/graphlab-machine-learning/13670/repost-graphlab-topic-modeling#utm_source=rss&utm_medium=rss&utm_campaign=repost-graphlab-topic-modeling

Labels:

[repost ]Large-scale graph computing at Google

original:http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html If you squint the right way, you will notice that graphs are everywhere. For example, social networks, popularized byWeb 2.0, are graphs that describe relationships among people. Transportation routes create a graph of physical connections among geographical locations. Paths of disease outbreaks form a graph, as do games among soccer teams, computer network topologies, [...]



via WordPress http://blog.newitfarmer.com/big-data_graph/13656/repost-large-scale-graph-computing-at-google#utm_source=rss&utm_medium=rss&utm_campaign=repost-large-scale-graph-computing-at-google

Labels:

[repost ]Graph Databases

original:http://adam.heroku.com/past/2010/3/15/graph_databases/ Graph databases are a type of datastore which treat the relationship between things as equally important to the things themselves. Examples of datasets that are natural fits for graph databases: Friend links on a social network “People who bought this also bought…” Amazon-style recommendation engines The world wide web In graph database parlance, a [...]



via WordPress http://blog.newitfarmer.com/nosql/graph-store/13650/repost-graph-databases-2#utm_source=rss&utm_medium=rss&utm_campaign=repost-graph-databases-2

Labels:

[repost ]5 Graph Databases to Consider

original:http://readwrite.com/2011/04/20/5-graph-databases-to-consider Of the major categories of NoSQL databases – document-oriented databases, key-value stores and graph databases – we’ve given the least attention to graph databases on this blog. That’s a shame, because as many have pointed out it may become the most significant category. Graph databases apply graph theory to the storage of information about [...]



via WordPress http://blog.newitfarmer.com/nosql/graph-store/neo4j-nosql/13651/repost-5-graph-databases-to-consider-2#utm_source=rss&utm_medium=rss&utm_campaign=repost-5-graph-databases-to-consider-2

Labels:

Tuesday, December 10, 2013

[repost ]为什么现在更多需要用的是 GPU 而不是 CPU,比如挖矿甚至破解密码?

original:http://www.zhihu.com/question/21231074/answer/20701124 从煎蛋一篇文章iOS热点密码不随机,破解仅需一分钟看到提到: 不过,他们成功的原因在一定程度上也要归功于破解硬件的发展:4张AMD Radeon 7970显卡的GPU一同工作可在50秒内完成破解。 又产生了前段时间对挖矿(bitcoin)的疑问,网上查到资料挖矿的速度无不在于GPU给不给力。 我的一贯认识中cpu才是运算速度的核心啊,为什么现在GPU的运算越来越流行 Cascade,Perfectionist 谨以此文纪念我的第一块显卡,nVidia Riva TNT2。 很久以前,大概2000年那时候,显卡还被叫做图形加速卡。一般叫做加速卡的都不是什么核心组件,和现在苹果使用的M7协处理器地位差不多。这种东西就是有了更好,没有也不是不行,只要有个基本的图形输出就可以接显示器了。在那之前,只有一些高端工作站和家用游戏机上才能见到这种单独的图形处理器。后来随着PC的普及,游戏的发展和Windows这样的市场霸主出现,简化了图形硬件厂商的工作量,图形处理器,或者说显卡才逐渐普及起来。 想要理解GPU与CPU的区别,需要先明白GPU被设计用来做什么。现代的GPU功能涵盖了图形显示的方方面面,我们只取一个最简单的方向作为例子。 大家可能都见过上面这张图,这是老版本Direct X带的一项测试,就是一个旋转的立方体。显示出一个这样的立方体要经过好多步骤,我们先考虑简单的,想象一下他是个线框,没有侧面的“X”图像。再简化一点,连线都没有,就是八个点(立方体有八个顶点的)。那么问题就简化成如何让这八个点转起来。首先,你在创造这个立方体的时候,肯定有八个顶点的坐标,坐标都是用向量表示的,因而至少也是个三维向量。然后“旋转”这个变换,在线性代数里面是用一个矩阵来表示的。向量旋转,是用向量乘以这个矩阵。把这八个点转一下,就是进行八次向量与矩阵的乘法而已。这种计算并不复杂,拆开来看无非就是几次乘积加一起,就是计算量比较大。八个点就要算八次,2000个点就要算2000次。这就是GPU工作的一部分,顶点变换,这也是最简单的一部分。剩下还有一大堆比这更麻烦的就不说了。 GPU的工作大部分就是这样,计算量大,但没什么技术含量,而且要重复很多很多次。就像你有个工作需要算几亿次一百以内加减乘除一样,最好的办法就是雇上几十个小学生一起算,一人算一部分,反正这些计算也没什么技术含量,纯粹体力活而已。而CPU就像老教授,积分微分都会算,就是工资高,一个老教授资顶二十个小学生,你要是富士康你雇哪个?GPU就是这样,用很多简单的计算单元去完成大量的计算任务,纯粹的人海战术。这种策略基于一个前提,就是小学生A和小学生B的工作没有什么依赖性,是互相独立的。很多涉及到大量计算的问题基本都有这种特性,比如你说的破解密码,挖矿和很多图形学的计算。这些计算可以分解为多个相同的简单小任务,每个任务就可以分给一个小学生去做。但还有一些任务涉及到“流”的问题。比如你去相亲,双方看着顺眼才能继续发展。总不能你这边还没见面呢,那边找人把证都给领了。这种比较复杂的问题都是CPU来做的。 总而言之,CPU和GPU因为最初用来处理的任务就不同,所以设计上有不小的区别。而某些任务和GPU最初用来解决的问题比较相似,所以用GPU来算了。GPU的运算速度取决于雇了多少小学生,CPU的运算速度取决于请了多么厉害的教授。教授处理复杂任务的能力是碾压小学生的,但是对于没那么复杂的任务,还是顶不住人多。当然现在的GPU也能做一些稍微复杂的工作了,相当于升级成初中生高中生的水平。但还需要CPU来把数据喂到嘴边才能开始干活,究竟还是靠CPU来管的。 至于如何将挖矿和破解密码这种事情分成小学生都能做的简单任务,就是程序员的工作了。所以以后谁再跟你说程序员的工作就是体力活,你可以直接抽他。 谢邀



via WordPress http://blog.newitfarmer.com/architecture/performance/13647/repost-%e4%b8%ba%e4%bb%80%e4%b9%88%e7%8e%b0%e5%9c%a8%e6%9b%b4%e5%a4%9a%e9%9c%80%e8%a6%81%e7%94%a8%e7%9a%84%e6%98%af-gpu-%e8%80%8c%e4%b8%8d%e6%98%af-cpu%ef%bc%8c%e6%af%94%e5%a6%82%e6%8c%96%e7%9f%bf#utm_source=rss&utm_medium=rss&utm_campaign=repost-%25e4%25b8%25ba%25e4%25bb%2580%25e4%25b9%2588%25e7%258e%25b0%25e5%259c%25a8%25e6%259b%25b4%25e5%25a4%259a%25e9%259c%2580%25e8%25a6%2581%25e7%2594%25a8%25e7%259a%2584%25e6%2598%25af-gpu-%25e8%2580%258c%25e4%25b8%258d%25e6%2598%25af-cpu%25ef%25bc%258c%25e6%25af%2594%25e5%25a6%2582%25e6%258c%2596%25e7%259f%25bf

Labels:

[repost ] Continuous Time Bayesian Network Reasoning and Learning Engine 1.1.1

original:http://mloss.org/revision/view/1480/ Description: CTBN-RLE is a continuous time Bayesian network reasoning and learning engine. A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuous-time Markov process. This software provides libraries and executables for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and [...]



via WordPress http://blog.newitfarmer.com/ai/machine-learning/13645/repost-continuous-time-bayesian-network-reasoning-and-learning-engine-1-1-1#utm_source=rss&utm_medium=rss&utm_campaign=repost-continuous-time-bayesian-network-reasoning-and-learning-engine-1-1-1

Labels:

[repost ]Lucene Query Syntax

original:http://www.solrtutorial.com/solr-query-syntax.html Lucene has a custom query syntax for querying its indexes. Unless you explicitly specify an alternative query parser such as DisMax or eDisMax, you’re using the standard Lucene query parser by default. Here are some query examples demonstrating the query syntax. Keyword matching Search for word “foo” in the title field. title:foo Search for [...]



via WordPress http://blog.newitfarmer.com/ai/ai-ir/lucene/13636/repost-lucene-query-syntax#utm_source=rss&utm_medium=rss&utm_campaign=repost-lucene-query-syntax

Labels:

[repost ]Java Serializable Object to Byte Array

original:http://stackoverflow.com/questions/2836646/java-serializable-object-to-byte-array Prepare bytes to send: ByteArrayOutputStream bos = new ByteArrayOutputStream(); ObjectOutput out = null; try { out = new ObjectOutputStream(bos); out.writeObject(yourObject); byte[] yourBytes = bos.toByteArray(); ... } finally { try { if (out != null) { out.close(); } } catch (IOException ex) { // ignore close exception } try { bos.close(); } catch (IOException ex) [...]



via WordPress http://blog.newitfarmer.com/java/others-java/13623/repost-java-serializable-object-to-byte-array#utm_source=rss&utm_medium=rss&utm_campaign=repost-java-serializable-object-to-byte-array

Labels:

[repost ]Converting Java objects to byte array, JSON and XML

original:http://syntx.co/languages-frameworks/converting-java-objects-to-byte-array-json-and-xml/ Converting a Java object (a process known as serialization) to various forms such as XML, JSON, or a byte array and back into java objects is a very common requirement. This post is intended to be a quick reference for you to easily make these conversions. Java Object to Byte Array and Back Lets [...]



via WordPress http://blog.newitfarmer.com/java/others-java/13621/repost-converting-java-objects-to-byte-array-json-and-xml#utm_source=rss&utm_medium=rss&utm_campaign=repost-converting-java-objects-to-byte-array-json-and-xml

Labels:

[repost ]Graph Databases: Trends in the Web of Data by Marko Rodriguez on Sep 18, 2010

original:http://www.slideshare.net/slidarko/graph-databases-trends-in-the-web-of-data Relational databases are perhaps the most commonly used data management systems. In relational databases, data is modeled as a collection of disparate tables. In order to unify the data within these . <iframe src=”http://www.slideshare.net/slideshow/embed_code/5228110″ width=”427″ height=”356″ frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no” style=”border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px” allowfullscreen> </iframe> <div style=”margin-bottom:5px”> <strong> <a href=”https://www.slideshare.net/slidarko/graph-databases-trends-in-the-web-of-data” title=”Graph Databases: [...]



via WordPress http://blog.newitfarmer.com/nosql/graph-store/13619/repost-graph-databases-trends-in-the-web-of-data-by-marko-rodriguez-on-sep-18-2010#utm_source=rss&utm_medium=rss&utm_campaign=repost-graph-databases-trends-in-the-web-of-data-by-marko-rodriguez-on-sep-18-2010

Labels:

[repost ]Syntax-Semantic Mapping for General Intelligence: Language Comprehension as Hypergraph Homomorphism, Language Generation as Constraint Satisfaction

original:http://agi-conference.org/2012/wp-content/uploads/2012/12/paper_40.pdf



via WordPress http://blog.newitfarmer.com/ai/nlp/13617/repost-syntax-semantic-mapping-for-general-intelligence-language-comprehension-as-hypergraph-homomorphism-language-generation-as-constraint-satisfaction#utm_source=rss&utm_medium=rss&utm_campaign=repost-syntax-semantic-mapping-for-general-intelligence-language-comprehension-as-hypergraph-homomorphism-language-generation-as-constraint-satisfaction

Labels:

[repost ]BBM draft errata

original:http://wiki.opencog.org/w/BBM_draft_errata Please list errata and comments about the BBM draft here. Please insert the errata/comments in the wiki page section corresponding to the relevant chapter. If no section on this wiki page exists for that chapter yet, just create that section using wiki syntax. When relevant, please indicate the specific page to which the correction [...]



via WordPress http://blog.newitfarmer.com/ai/ai-common/13614/repost-bbm-draft-errata#utm_source=rss&utm_medium=rss&utm_campaign=repost-bbm-draft-errata

Labels:

[repost ]OpenCog Theory

original:http://opencog.org/theory/ General Intelligence via Cognitive Synergy OpenCog is a diverse assemblage of cognitive algorithms, each embodying their own innovations — but what makes the overall architecture powerful is its careful adherence to the principle of cognitive synergy. The human brain consists of a host of subsystems carrying out particular tasks — some more specialized, some [...]



via WordPress http://blog.newitfarmer.com/ai/cognitive-computing/13587/repost-opencog-theory#utm_source=rss&utm_medium=rss&utm_campaign=repost-opencog-theory

Labels:

[repost ]A Comparison of 7 Graph Databases

original:http://nosql.mypopescu.com/post/40759505554/a-comparison-of-7-graph-databases The main page of InfiniteGraph, a graph database commercialized by Objectivity, features an interesting comparison of 7 graph databases (InfiniteGraph, Neo4j, AllegroGraph, Titan, FlockDB, Dex, OrientDB) based on 16 criteria: licensing, source, scalability, graph model, schema model, API, query method, platforms, consistency, concurrency (distributed processing), partitioning, extensibility, visualizing tools, storage back end/persistency, language, backup/restore. [...]



via WordPress http://blog.newitfarmer.com/nosql/graph-store/13584/repost-a-comparison-of-7-graph-databases#utm_source=rss&utm_medium=rss&utm_campaign=repost-a-comparison-of-7-graph-databases

Labels:

[repost ]Graph Databases, The Web of Data Storage Engines by Pere Urbón-Bayeson Feb 05, 2011

original:http://www.slideshare.net/purbon/graph-databases-the-web-of-data-storage-engines Graph Databases, The Web of Data Storage Engines from Pere Urbón-Bayes



via WordPress http://blog.newitfarmer.com/nosql/graph-store/13578/repost-graph-databases-the-web-of-data-storage-engines-by-pere-urbon-bayeson-feb-05-2011#utm_source=rss&utm_medium=rss&utm_campaign=repost-graph-databases-the-web-of-data-storage-engines-by-pere-urbon-bayeson-feb-05-2011

Labels:

Monday, December 09, 2013

[repost ]HypergraphDB slideshare by Jan Drozen on Oct 28, 2012

original :http://www.slideshare.net/Drozi1/hypergraphdb-14923552 A gentle introduction to the HypergraphDB database. HypergraphDB from Jan Drozen



via WordPress http://blog.newitfarmer.com/nosql/graph-store/hypergraphdb/13568/repost-hypergraphdb-slideshare#utm_source=rss&utm_medium=rss&utm_campaign=repost-hypergraphdb-slideshare

Labels:

[repost ]Is HyperGraphDB an Object-Oriented Database?

original:http://kobrix.blogspot.com/2010/02/is-hypergraphdb-object-oriented.html Back in the 90s, the “killer” of RDBMs were presumed to be the ODBMs. Today it is NoSQL. Why are RDBMs a prey to be killed, and why should any other approach be a voracious predator rather than a gentle companion has never been clear to me. Industry fads are always a bit ridiculous [...]



via WordPress http://blog.newitfarmer.com/nosql/graph-store/hypergraphdb/13563/repost-is-hypergraphdb-an-object-oriented-database#utm_source=rss&utm_medium=rss&utm_campaign=repost-is-hypergraphdb-an-object-oriented-database

Labels:

[repost ]HyperGraphDB › Comparison to neo4j

original:https://groups.google.com/forum/#!topic/hypergraphdb/dZU_Ol-9b_M Hello, General question from a person who just found out about HyperGraphDB and who has been watching that seems almost like hype around neo4j: How does HyperGraphDB compare to neo4j, technically speaking? Thanks, Otis P.S. Nice Wiki documentation! ———————————> I can’t really comment on the technical merits of neo4j since I haven’t used it. [...]



via WordPress http://blog.newitfarmer.com/nosql/graph-store/hypergraphdb/13561/repost-hypergraphdb-comparison-to-neo4j#utm_source=rss&utm_medium=rss&utm_campaign=repost-hypergraphdb-comparison-to-neo4j

Labels:

Sunday, December 08, 2013

[repost ]gwtwiki (Bliki engine):MediaWikiAPISupport

original:https://code.google.com/p/gwtwiki/wiki/MediaWikiAPISupport Helper classes for the api.php Connecting through a HTTP proxy Troubleshooting Example – Get all members of a category Example – Get all interlanguage links Example – Get the image URL Example – Get the raw wiki content for all category members Example – Get the raw wiki content for all pages contained in [...]



via WordPress http://blog.newitfarmer.com/ai/kr/mediawiki-kr/13545/repost-gwtwiki-bliki-enginemediawikiapisupport#utm_source=rss&utm_medium=rss&utm_campaign=repost-gwtwiki-bliki-enginemediawikiapisupport

Labels:

Saturday, December 07, 2013

[repost ]Kafka:Using the High Level Consumer

original:https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example Why use the High Level Consumer Sometimes the logic to read messages from Kafka doesn’t care about handling the message offsets, it just wants the data. So the High Level Consumer is provided to abstract most of the details of consuming events from Kafka. First thing to know is that the High Level Consumer [...]



via WordPress http://blog.newitfarmer.com/message/kafka/13540/repost-kafkausing-the-high-level-consumer#utm_source=rss&utm_medium=rss&utm_campaign=repost-kafkausing-the-high-level-consumer

Labels:

Friday, December 06, 2013

[repost ]Distilling Free-Form Natural Laws from Experimental Data

original:http://www.uvm.edu/~cmplxsys/newsevents/pdfs/2009/schmidt2009a.pdf The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/324/5923/81 version of this article at: Updated information and services, including high-resolution figures, can be found in the online http://www.sciencemag.org/cgi/content/full/324/5923/81/DC1 Supporting Online Material can be found at: found at: A list of selected additional articles on the Science Web sites related to this [...]



via WordPress http://blog.newitfarmer.com/big_data/big-data-others/13528/repost-distilling-free-form-natural-laws-from-experimental-data#utm_source=rss&utm_medium=rss&utm_campaign=repost-distilling-free-form-natural-laws-from-experimental-data

Labels:

[repost ]Foursquare :How we built our Model Training Engine

original:http://engineering.foursquare.com/2013/12/05/how-we-built-our-model-training-engine/ At Foursquare, we have large-scale machine-learning problems. From choosing which venue a user is trying to check in at based on a noisy GPS signal, to serving personalized recommendations, discounts, and promoted updates to users based on where they or their friends have been, almost every aspect of the app uses machine-learning in some [...]



via WordPress http://blog.newitfarmer.com/ai/data-mining/13521/repost-foursquare-how-we-built-our-model-training-engine#utm_source=rss&utm_medium=rss&utm_campaign=repost-foursquare-how-we-built-our-model-training-engine

Labels:

[repost ]samza document,Background,Concepts,Architecture

original:http://samza.incubator.apache.org/learn/documentation/0.7.0/ Introduction Background Concepts Architecture Comparisons Introduction MUPD8 Storm API Overview Javadocs Container TaskRunner Streams Checkpointing State Management Metrics Windowing Event Loop JMX Jobs JobRunner Configuration Packaging YARN Jobs Logging Background This page provides some background about stream processing, describes what Samza is, and why it was built. What is messaging? Messaging systems are a [...]



via WordPress http://blog.newitfarmer.com/big_data/streams/samza/13519/repost-samza-documentbackgroundconceptsarchitecture#utm_source=rss&utm_medium=rss&utm_campaign=repost-samza-documentbackgroundconceptsarchitecture

Labels:

Wednesday, December 04, 2013

[repost ]使用logstash+elasticsearch+kibana快速搭建日志平台

original:http://www.cnblogs.com/buzzlight/p/logstash_elasticsearch_kibana_log.html 日志的分析和监控在系统开发中占非常重要的地位,系统越复杂,日志的分析和监控就越重要,常见的需求有: 根据关键字查询日志详情 监控系统的运行状况 统计分析,比如接口的调用次数、执行时间、成功率等 异常数据自动触发消息通知 基于日志的数据挖掘 很多团队在日志方面可能遇到的一些问题有: 开发人员不能登录线上服务器查看详细日志,经过运维周转费时费力 日志数据分散在多个系统,难以查找 日志数据量大,查询速度慢 一个调用会涉及多个系统,难以在这些系统的日志中快速定位数据 数据不够实时 常见的一些重量级的开源Trace系统有 facebook scribe cloudera flume twitter zipkin storm 这些项目功能强大,但对于很多团队来说过于复杂,配置和部署比较麻烦,在系统规模大到一定程度前推荐轻量级下载即用的方案,比如logstash+elasticsearch+kibana(LEK)组合。 对于日志来说,最常见的需求就是收集、查询、显示,正对应logstash、elasticsearch、kibana的功能。 logstash logstash主页 logstash部署简单,下载一个jar就可以用了,对日志的处理逻辑也很简单,就是一个pipeline的过程 inputs >> codecs >> filters >> outputs 对应的插件有 从上面可以看到logstash支持常见的日志类型,与其他监控系统的整合也很方便,可以将数据输出到zabbix、nagios、email等。 推荐用redis作为输入缓冲队列。 你还可以把数据统计后输出到graphite,实现统计数据的可视化显示。 metrics demo statsd graphite 参考文档 cookbook doc demo elasticsearch elasticsearch主页 elasticsearch是基于lucene的开源搜索引擎,近年来发展比较快,主要的特点有 real time distributed high availability document oriented [...]



via WordPress http://blog.newitfarmer.com/devops/logging-devops/13497/repost-%e4%bd%bf%e7%94%a8logstashelasticsearchkibana%e5%bf%ab%e9%80%9f%e6%90%ad%e5%bb%ba%e6%97%a5%e5%bf%97%e5%b9%b3%e5%8f%b0#utm_source=rss&utm_medium=rss&utm_campaign=repost-%25e4%25bd%25bf%25e7%2594%25a8logstashelasticsearchkibana%25e5%25bf%25ab%25e9%2580%259f%25e6%2590%25ad%25e5%25bb%25ba%25e6%2597%25a5%25e5%25bf%2597%25e5%25b9%25b3%25e5%258f%25b0

Labels:

Tuesday, December 03, 2013

[origin ] install storm cluster 0.8.2

[dpeuser@dpev209 ~]$ cd /opt [dpeuser@dpev209 opt]$ sudo mkdir storm [dpeuser@dpev209 opt]$ ls cloudera eclipse hadoop hbase-data IBM Kerberos soft steer sun tivoli wsa DoOnceLinux elasticsearch hbase ibm jdk7 neo4j solr storm test tomcat7 zookeeper [dpeuser@dpev209 opt]$ cd storm/ [dpeuser@dpev209 storm]$ ls [dpeuser@dpev209 storm]$ sudo wget https://www.dropbox.com/s/fl4kr7w0oc8ihdw/storm-0.8.2.zip –2013-12-03 23:50:38– https://www.dropbox.com/s/fl4kr7w0oc8ihdw/storm-0.8.2.zip Resolving www.dropbox.com… 108.160.166.20 Connecting to www.dropbox.com|108.160.166.20|:443… [...]



via WordPress http://blog.newitfarmer.com/big_data/streams/storm/13489/origin-install-storm-cluster-0-8-2#utm_source=rss&utm_medium=rss&utm_campaign=origin-install-storm-cluster-0-8-2

Labels:

[repost ]Inspecting Obj-C parameters in gdb

original:http://www.clarkcox.com/blog/2009/02/04/inspecting-obj-c-parameters-in-gdb/ Since the addition of i386 and x86_64 to the Mac OS’s repertoire several years back, remembering which registers are used for what has become difficult, and this can complicate the debugging of code for which you have no symbols. So here is my cheat-sheet (posted here, mostly so that I can find it again [...]



via WordPress http://blog.newitfarmer.com/programming/objective-c/13477/repost-inspecting-obj-c-parameters-in-gdb#utm_source=rss&utm_medium=rss&utm_campaign=repost-inspecting-obj-c-parameters-in-gdb

Labels:

Sunday, December 01, 2013

[repost ]NIPS 2013 papers

original:http://cs.stanford.edu/people/karpathy/nips2013/ Below every paper are TOP 100 most-occuring words in that paper and their color is based on LDA topic model with k = 7. (It looks like 0 = reinforcement learning, 1 = deep learning, 2 = structured learning?, 3 = optimization?, 4 = graphical models, 5 = theory, 6 = neuroscience) Toggle LDA [...]



via WordPress http://blog.newitfarmer.com/ai/machine-learning/13463/repost-nips-2013-papers#utm_source=rss&utm_medium=rss&utm_campaign=repost-nips-2013-papers

Labels: