Hajussüsteemide (uurimis)seminar / Weekly seminar on Distributed Systems
Seminar: MTAT.08.014 Distribute Systems / Hajussüsteemide seminar 8EAP (2+2+2+2) and MTAT.08.024 12EAP (3+3+3+3) + MTAT.08.019 Distributed Systems Research Seminar / Hajussüsteemide uurimisseminar 20EAP (5+5+5+5) (PhD level) Scheduled to / Toimumine kevadsemestril 2012: K 14:15 Liivi 2 - 315 (R 14:15 L2-315)
Organization and requirements
Final presentations of the Distributed Systems Seminar will take place on FRI the 25th May starting at 10:15 in Liivi 2 - 403
------------------------------------------------------------
15.02.2012: First meeting, topic selection
22.02.2012: Finalizing the topic choice, consultation on how to proceed.
------------------------------------------------------------
Deliverables:
- Written report or Blog
- Final presentation
Satish Srirama and Ulrich Norbisrath and Eero Vainikko (Responsible persons):
Topic Areas (Concerned persons):
1. Parallel Scientific Applications and Concurrent Computing (Eero Vainikko, Oleg Batrashev)
2. Cloud computing (Pelle Jakovits, Vladimir Shor, Jyrmo Mehine, Briti Deb)
3. Mobile computing (Huber Flores)
4. Friend to Friend (F2F) computing (Artjom Lind)
5. Exploratory search (George Singer, Dimitri Danilov, Tonu Tamme)
Topics
Possible themes with some suggested materials to start with:
1. Parallel Scientific Applications and Concurrent Computing (eero at ut.ee)
- Computing for multicore systems (olegus at ut.ee)
- Intel Thread Building Blocks (TBB) - task parallelism in C++
- Intel Array Building Blocks (ArBB) - sophisticated data parallelism in C++
- any other library for multicore systems
- Concurrent programming languages (olegus at ut.ee)
- The Go Programming Language from Google: channels as first-class objects
- any other language for concurrent programming
- Languages for scientific computing (all Partitioned Global Space languages)
- Chapel -- http://chapel.cray.com/
- Fortress -- http://projectfortress.java.net/
- X10 -- http://x10-lang.org/
- Denoising 2D and 3D images with the help of sparse solvers
- Programming on PlayStation 3
- Interval arithmetics for algorithm accuracy assessments
- Interactive interfaces to parallel computing systems
- Latest trends and environments for GPU Computing
- Latest trends in multicore CPU developments
- Latest trends in algorithm development for shared memory architectures, lock-free datastructures etc.
- Forest simulations (- project in collaboration with Steffen M. Noe, Agricultural University)
2. Cloud Computing (satish.srirama at ut.ee):
2.1 Parallel computing -- Pelle Jakovits (Responsible person)
- SAC - Single Assignment C - a distributed array programming language predominantly suited for application areas such as numerically intensive applications and signal processing.
- Q - Overview of using Q array processing programming language and kdb+ database for distributed computing. (kdb+ is proprietary but kdb+ 32 bit is usable for non-commercial use)
- CloudSwitch - It provides migration and deployment of virtual machines to the cloud and allows to integrate the migrated servers with existing system running locally or in other clouds. Additionally, CloudSwitch provides secure channels and firewall virtual machines that can be deployed to secure the communication and data between local cluster and virtual machines running in different public cloud platforms. Do a literature survey into papers written about the CW and its features, compare CloudWatch to other similar tools and related work that try to support similar cloud migration and deployment process.
- High Performance Computing support from different public cloud providers and their comparison.Amazon EC2 provides High Performance Computing type instances for scientists, research groups and companies who require high quality dedicated servers for resource demanding computing tasks and experiments. What other public cloud providers have similar services, what are the options for scientists to choose from and how do these options compare to each other (pricing, availability, size of available resources, performance, etc.)
-
Pricing of virtual machines in different public cloud providers. Providing public cloud services is a competitive business. Every year new providers enter the market to provide similar services to Amazon EC2, Azure, Google, GoGrid, etc. The diversity of cloud providers leads to a practical question: how well do the cloud providers perform when compared to each other? Do a literature survey into existing research into comparisons of different cloud providers and their services. Also, compare the results you found to the current state of the market, pricing, offered services, while also taking account of newest cloud providers in the market.
-
HaLoop – Alternative MapReduce framework for large scale distributed cloud computing. How does it differ from Hadoop MapReduce, what are its disadvantages and advantages?
-
Spark – Alternative MapReduce framework for large scale distributed cloud computing. How does it differ from Hadoop MapReduce, what are its disadvantages and advantages?
- JPREGEL - A Java BSP based large scale distributed graph computing framework. It is based on the Google's proprietary Pregel framework.
-
Hadoop cloud computing projects:
(Check Cloudera for Installing Hadoop sub projects using Cloudera CDH3)- Hadoop on Amazon EC2 spot instances. Literature survey into existing studies and implementing a use case Hadoop application on amazon EC2 that utilizes a portion (30% - 50%) of (amazon) resources from temporary spot instances. What are the disadvantages and advantages of using spot instances, is it possible to reduce the costs if using them and how high per cent of spot instances can be utilized without overly compromising fault tolerance.
-
Pydoop – Python MapReduce and HDFS API for Hadoop. Allowing to write MapReduce applications for Hadoop in Python.
-
HDFS & Hbase – HDFS is Hadoop distributed file system for storing huge data on clouds. Based on Google File System. Hbase is Hadoop distributed database. Provides random, real time read/write access to distributed data in cloud.
-
Hive – Hive is Data warehouse infrastructure built on top of Hadoop, provides tools for summarization, ad-hoc querying and analysis of large datasets. What exactly are the differences between Hive and another data quesry language based on Hadoop called Pig.
-
Whirr - Apache Whirr is a set of libraries for running cloud services. You can use Whirr to run Cloudera CDH3 clusters on cloud providers' clusters, such as Amazon Elastic Compute Cloud (Amazon EC2). There's no need to install the RPMs for CDH3 or do any configuration; a working cluster will start immediately with one command. It's ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs. When you are finished, you can destroy the cluster and all of its data with one command.
-
Mountable HDFS – Replicated mountable Hadoop distributed file system. Study its usability for replicated network storage, performance latencies and compare it to other similar solutions.
-
MapReduce efficiency – Study of what can affect the parallel efficiency of applications adapted to Hadoop MapReduce framework.
-
ZooKeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services for distributed applications. Implement a Zookeeper service on SciCloud and a simple distributed demo application that uses the Zookeeper for configuration.
-
Chukwa - Data collection and logging system for monitoring large distributed systems. Allows to display, monitor and analyze the results. Study and implement Chukwa for SciCloud and demonstrate it.
-
Oozie – Workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce.
-
Flume – Data collection framework for Hadoop. Flume is a distributed service that makes it very easy to collect and aggregate your data into a persistent store such as HDFS. Flume can read data from almost any source – log files, Syslog packets, the standard output of any Unix process – and can deliver it to a batch processing system like Hadoop or a real-time data store like HBase.
- Hadoop on Amazon EC2 spot instances. Literature survey into existing studies and implementing a use case Hadoop application on amazon EC2 that utilizes a portion (30% - 50%) of (amazon) resources from temporary spot instances. What are the disadvantages and advantages of using spot instances, is it possible to reduce the costs if using them and how high per cent of spot instances can be utilized without overly compromising fault tolerance.
-
Bulk Synchronous Parallel (BSP) - distributed computing model. Introduce the model and make a literature survey into cloud computing frameworks that use the BSP model. What advantages does it have over the MapReduce model for distributed computing.
-
Graph computing on the Cloud - Overview of the different distributed cloud frameworks for graph processing. Additionally can implement one of the Graph computing frameworks on the SciCloud and demonstrate it.
-
Cascading (http://www.cascading.org/) - Cascading is a Java API for defining and executing complex, scale-free, and fault tolerant data processing work-flows on a Hadoop.
-
Mesos (http://github.com/mesos) - Mesos is a platform for sharing multiple diverse cluster computing frameworks, such as Hadoop, MPI, and web services, on commodity clusters. Additionally should implement it on SciCloud and demonstrate it.
-
MapReduce in data mining. Implementing some data mining algorithms in MapReduce.
-
Pervasive cloud services:
-
The seminar student should look at accessing several cloud services from mobile phones. The topic should also find a nice application to demonstrate on the phones.
-
Mobile technology can be chosen by the student. Interesting platforms are Android, i-phone, J2ME.
-
2.3 Mobile Cloud -- Huber Flores (Responsible person)
- A queue middleware mechanism (with mathematical analysis/model) for providing QoS in the invocation of cloud services from the mobile.
- Comparison of Load Balancing techniques with multiple tools (HAProxy, etc) for MCM (Mobile Cloud Middleware) Framework. (Using tools such as Tsung, JMeter, etc.)
- Data mining Analysis on sensor data (accelerometer, magnetic field, etc.) for the provisioning of context-aware services for mobile phones (using cloud services such as Hadoop, etc)
- Creating a graphic composition interface of cloud services in HTML 5 (services are retrieved from a middleware framework (MCM)). This composition is executed by the middleware for the invocation from mobile phones.
- Implementation of real-time communication protocols such as XMPP between the mobile phone and the cloud. (or mobile phone and the MCM)
- Data forensic in the mobile phone. The idea consists of moving all the content from the physical device to the emulator so the emulator can be analyzed.
- Create a Mobile Desktop using the Web browser (HTML 5) for monitoring resources in the cloud. Buckets can be mapped to folders, EC2 EBS drives as storage devices, etc. (an Example http://dev.sencha.com/deploy/ext-4.0.0/examples/sandbox/sandbox.html)
- Migration from native android applications to HTML5 applications. One application that takes a native android application and transforms it into a HTML5 application.
- Analysis of performance of mobile peer-to-peer file sharing using Mobile Host for Android.
- Creating a SyncML communication with MCM for handling invocation of cloud services from the mobile.
- Extending a Synchronization engine for the communication peer-to-peer using mobile phones.
2.4 Enterprise Cloud -- Vladimir Sor (Responsible person)
- Version rollout – step by step upgrade of applications. (volli at ut.ee)
How facebook, google and others test their new versions of applications on subset of users before affecting all users. Techniques, technologies, use cases, examples. - Relocation techniques on cloud – how to analyze and implement application/database relocation on the fly. (volli at ut.ee)
For example if you have cloud instances in Europe then how to detect and how to relocate instances to U.S. if it turns out that the majority of requests come in from the States. How to setup load balancers to support this? Are there any existing services for that by e.g., Amazon? - JInspired and Costicity (http://costicity.jinspired.com). (volli at ut.ee)
What it does and how it works? Is it possible to try out? - How to find memory leaks with http://www.appdynamics.com/ (volli at ut.ee)
(Lite version is available for download)? Cover web-apps and desktop apps (hadoop like apps?)
2.5 Network analysis on the Cloud -- Briti Deb (Responsible person)
- Mining Biological Data towards Inferring Gene Regulatory Networks.
- Clustering and Community Detection in Networks.
- Network Analysis for Business Applications.
- Detection of Network Motifs in Social / Biological Networks.
2.6 Data storage, text mining and web services on the cloud -- Jyrmo Mehine (Responsible Person)
- Solving a data mining problem using Pig or Hive, measuring performance.
Pig and Hive are tools built on top of the Apache Hadoop MapReduce framework. Pig is meant mainly for data processing and preparation, while Hive is a tool for data warehousing, querying and presentation. We are interested in seeing some text mining algorithms implemented using one or both of these tools. We want to determine the feasibility of these tools for different data analysis tasks and also to measure the performance on large data in a distributed deployment. We are also interested in measuring productivity gains when using these tools compared to using plain Hadoop MapReduce. - Common Crawl analysis.
Common Crawl maintains an open repository of web crawl data. We can mine this raw data for interesting information (trends, word frequencies etc.) using distributed data processing tools available to us (MapReduce, Pig, Hive, Mahout). - Writing an automated tool for data migration from SQL to NoSQL.
In recent years the NoSQL movement is picking up speed and many companies are switching to non-relational databases. We want a tool that can take an existing SQL database and automatically transfer all the data from it to a NoSQL system. An incomplete list of NoSQL systems to try: Cassandra, Redis, CouchDB, MongoDB, HBase. - Comparing the performance of different NoSQL database systems.
Database access speeds are currently very important in computing. We want to test different NoSQL systems to find out which performs best in different use cases and different distribution conditions. What are the strengths and weaknesses of different NoSQL systems. - Web services on the cloud.
There are data-heavy enterprise software development and integration projects which we are interested in solving using web services. - Overview of distributed version control systems.
Distributed version control systems have some features in common with distributed databases. It would be interesting to see the main features of DVCS-s outlined and explained. - Collaborative infrastructure
We want to create a web based project management and planning software (similar to Trac) to keep track of our research projects. Exact details of the system are still to be worked out. In addition to basic task management functionality the following keywords are worth considering: - Evidence-based scheduling
- Collaborative plaintext editing (similar to Google Docs)
- Git version control integration
- RESTful plugin architecture
- HTML5 based service oriented modeling application.
When designing enterprise software a diagram is often created to visualize the architecture. To simplify this process we need a point-and-click interface for compiling the diagram. The interface should have icons representing familiar elements of an architecutre - e.g database, web service, message queue etc. The system should allow for exporting and mporting a diagram in an xml based format. Automatic code generation from the diagram would be an interesting extension to this project. - Building a web application without a server-side language using JavaScript and CouchDB
The CouchDB non-relational database has a rich HTTP API and a self-contained web server. The CouchDB programming model allows to store and serve HTML and JavaScript code. This means that it is possible to create rich AJAX applicationsthat access the database from the web front end without the need for a server-side programming language. We are interested in evaluating the feasibility in terms of programming effort for this model compared to classical three-tier web application architectures. - Graph database survey
We are interested in learning more about different graph databases e.g Neo4J and InfiniteGraph. We want to see how some problems e.g collaborative filtering can be solved using graph databases. - Web feed trend analysis
We can apply text mining and knowledge discovery methods to data accumulated from different sources on the web (blogs, Twitter, news papers etc). We can study different distributed data processing tools (Pig, Hive, Mahout etc) when analysing this data.
3. Friend to Friend (F2F) computing (Artjom Lind)
- Computing Engines
- mpi4py on Friend-to-Friend Computing.
Currently python computing engine provides standard F2F API communication interfaces (f2fSend/f2fReceive). It is essential to switch to provide the general methods specified in MPI (Message Passing Interface) in order to make F2F convenient for developers who are used to MPI. Python is well suited for prototyping hence in the initial version we port Python version of MPI (mpi4py) to F2F and if it succeeds we proceed to low-level languages.
Supervised by Artjom Lind and Ulrich Norbisrath - Friend-to-Friend Sage. (<- Currently taken by Raivo Eriksoo)
Sage is open source math system (alternative to Matlab, Magma, Maple), which allows to solve complex mathematical and algebraic task using Python command line interface or intuitive GUI-fronted. The to make it possible to parallelize and distribute Sage computing tasks on multiple machines using F2F Computing.
Supervised by Artjom Lind and Eero Vainikko - LLVM for Friend-to-Friend Computing
LLVM is a low level virtual machine infrastructure which allows to run the native code independent of the target architecture. It compiles the source code to the intermediate form which is then being distributed to the computing nodes and converted and linked into machine-dependent assembly code. Currently F2F supports LLVM v2.5, however there is 2.7 out already. The goal is to explore a new features of LLVM 2.7 and extend the corresponding F2F computing engine.
Supervised by Artjom Lind and Ulrich Norbisrath - JVM on Friend-to-Friend Computing
JVM (Java Virtual Machine) allows to run Java byte code independent of platform architecture. Java is a popular mainstream programming language, its support is crucial to make F2F a successful distributed computing platform. The goal is research the different implementations of JVMs (List of known JVMs), select the most suitable and use to implement the F2F JVM computing engine.
Supervised by Artjom Lind and Ulrich Norbisrath - Sandboxing / Virtualization support for Friend-to-Friend Computing (<- Currently taken by Mati Vait)
Sandboxing is another crucial aspect of any distributed platform. In order to secure the remote code execution different virtualization technologies are used. The platform (hardware) virtualization is one of them, it hides the physical characteristics of a computing platform from users, instead showing another abstract computing platform. The goal is to design and implement F2F computing module with the sandboxing support taking as base the exisiting virtualization solutions (QEMU, KVM and VDE)
Supervised by Artjom Lind and Ulrich Norbisrath
- Communication Modules
- SSH Server relay communication adapter
Peer-to-Peer Network in F2F Computing is bootstrapped through existing secure authenticated channel, currently provided by one of Instant Messaging protocols. However it is not the only possibility, if consider the users of one SSH server (exmpl: math.ut.ee), we can relay the communication between two of these users over the math server using ssh clients and UNIX pipes (consider this specification as a base). This solution will allow users to join F2F network without having any Instant Messaging accounts.
Supervised by Artjom Lind and Ulrich Norbisrath - IP communication reestablishment after failure
Currently the Peer-to-Peer Network in F2F Computing is dependent on Instant Messaging protocols, in case of communication failure the bootstrap should be repeated, however this is not the case with short network failures. The process of communication restore should be quick enough in order not to reduce the performance of distributed applications. There is a number of solutions possible, one of them is local host cache (used in Gnutella), it keeps track of all host ever connected making the bootstrap process much quicker. The goal is to develop similar solution for F2F Peer-to-Peer network.
Supervised by Artjom Lind and Ulrich Norbisrath - IPv6 communication for Friend-to-Friend Computing (<- Currently taken by Mihhail Klimenko)
IPv4 is now officially depleted, Internet focused on IPv6, will it expand faster now? Will it put end on Network Address Translation ? The advantages of IPv6 over IPv4 are obvious, hence there is a need to support the modern IP communication protocol in Friend-to-Friend computing.
Supervised by Artjom Lind and Ulrich Norbisrath - Core Features
- DHT resource index
- Globally Unique ID numbers
- Secure authentication of Peers in computing groups
- Front-Ends
- More messenger plug-ins
- Distributed Applications
- Swarm based content sharing application
- Distributed video rendering
- Multi-player games
- Voice and Video over F2F
- VPN over F2F
- VDE over F2F
4. Exploratory search (George Singer, Dimitri Danilov, Tõnu Tamme)
- Collaborative search
- Search Patterns
- New Development in Search Engines
- The Vision of Ted Nelson (the inventor of the internet?)
- Xanadu (and undanax)
- Graph Based Information Storage
- New Search Interfaces in Mozilla (practical and theoretical topics available)
- Machine learning based topic modeling in text documents (using the program Mallet).
- More topics on demand

