Hadoop Spark (Mapr) - AddFile how does it work -


I'm trying to understand how it works. Say I have 10 databases on HDF, this is a 100 file, which I want to process with SPARC.

In the book - Fast data processing with SPARC

This file needs to be available on all nodes in the cluster, which is not much of a problem for a local Mode If you are in a distributed mode, you want to use Spark's addFile functionality to copy the file on all the machines in your cluster.

I can not understand that, the copy will be sparking on the file on each node what should I need to read the file that is present in that directory (if it Directory exists on that node) Sorry, I'm a bit confused, how to handle the above situation in SPARC. The relation is

itemprop = "text">

submit SparkContext :: addFile in a misleading context. This is a section titled "Loading data in RDD", but it immediately gets separated from that goal and introduces SparkContext :: addFile as a way to get data in SPARC . Over the next few pages it gives an introduction to "actual data" in an RDD, such as the SparkContext :: parallelize and SparkContext :: textfile . Rather than splitting the data between all the nodes nodes, resolve your concerns about the complete copy of the data.

SparkContext :: addFile is a real output for user-case to make a configuration file available on some libraries which can only be configured from the file on the disk. For example, when using the Maxmind's GeoIP legacy API, you can configure the view object (as a field on some squares) in a distributed map like this:

  @transient Lazy Val GeoIP = New LookupService ("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)   

Out your map function, you need to make GeoIP.dat available like this Will be:

  sc.addFile ("/ path / to / GeoIP.dat")   

then spark current tasks on all nodes Will make available in the directory.

Therefore, unlike the answer to Daniel Durbaus, there are some reasons outside of the experiment to use SparkContext :: addFile Also, I can not find any information in the documentation. That one motivates one to believe that the output is not production-ready. However, I believe that this is not the data that you want to use to load the data that you are trying to process, unless the interactive spark is for use in Replus, because it Does not make RDD.

Comments

Popular posts from this blog

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -

jasper reports - How to center align barcode using jasperreports and barcode4j -

django - CommandError: You must set settings.ALLOWED_HOSTS if DEBUG is False -